% count/count.tex

\QuickQuizChapter{chp:Counting}{Counting}

Counting is perhaps the simplest and most natural thing a computer can do.
However, counting efficiently and scalably on a large
shared-memory multiprocessor can be quite challenging.
Furthermore, the simplicity of the underlying concept of counting
allows us to explore the fundamental issues of concurrency without
the distractions
of elaborate data structures or complex synchronization primitives.
Counting therefore provides an excellent introduction to
parallel programming.

This chapter covers a number of special cases for which there are simple,
fast, and scalable counting algorithms.
But first, let us find out how much you already know about concurrent
counting.

\QuickQuiz{}
	Why on earth should efficient and scalable counting be hard?
	After all, computers have special hardware for the sole purpose
	of doing counting,
	addition, subtraction, and lots more besides, don't they???
\QuickQuizAnswer{
	Because the straightforward counting algorithms, for example,
	atomic operations on a shared counter, either are slow and scale
	badly, or are inaccurate, as will be seen in
	Section~\ref{sec:count:Why Isn't Concurrent Counting Trivial?}.
} \QuickQuizEnd

\QuickQuiz{}
	{ \bfseries Network-packet counting problem. }
	Suppose that you need to collect statistics on the number
	of networking packets (or total number of bytes) transmitted
	and/or received.
	Packets might be transmitted or received by any CPU on
	the system.
	Suppose further that this large machine is capable of
	handling a million packets per second, and that there
	is a systems-monitoring package that reads out the count
	every five seconds.
	How would you implement this statistical counter?
\QuickQuizAnswer{
	Hint: The act of updating the counter must be blazingly
	fast, but because the counter is read out only about once
	in five million updates, the act of reading out the counter can be
	quite slow.
	In addition, the value read out normally need not be all that
	accurate---after all, since the counter is updated a thousand
	times per millisecond, we should be able to work with a value
	that is within a few thousand counts of the ``true value'',
	whatever ``true value'' might mean in this context.
	However, the value read out should maintain roughly the same
	absolute error over time.
	For example, a 1\% error might be just fine when the count
	is on the order of a million or so, but might be absolutely
	unacceptable once the count reaches a trillion.
	See Section~\ref{sec:count:Statistical Counters}.
} \QuickQuizEnd

\QuickQuiz{}
	{ \bfseries Approximate structure-allocation limit problem. }
	Suppose that you need to maintain a count of the number of
	structures allocated in order to fail any allocations
	once the number of structures in use exceeds a limit
	(say, 10,000).
	Suppose further that these structures are short-lived,
	that the limit is rarely exceeded, and that a ``sloppy''
	approximate limit is acceptable.
\QuickQuizAnswer{
	Hint: The act of updating the counter must again be blazingly
	fast, but the counter is read out each time that the
	counter is increased.
	However, the value read out need not be accurate
	\emph{except} that it must distinguish approximately
	between values below the limit and values greater than or
	equal to the limit.
	See Section~\ref{sec:count:Approximate Limit Counters}.
} \QuickQuizEnd

\QuickQuiz{}
	{ \bfseries Exact structure-allocation limit problem. }
	Suppose that you need to maintain a count of the number of
	structures allocated in order to fail any allocations
	once the number of structures in use exceeds an exact limit
	(again, say 10,000).
	Suppose further that these structures are short-lived,
	and that the limit is rarely exceeded, that there is almost
	always at least one structure in use, and suppose further
	still that it is necessary to know exactly when this counter reaches
	zero, for example, in order to free up some memory
	that is not required unless there is at least one structure
	in use.
\QuickQuizAnswer{
	Hint: The act of updating the counter must once again be blazingly
	fast, but the counter is read out each time that the
	counter is increased.
	However, the value read out need not be accurate
	\emph{except} that it absolutely must distinguish perfectly
	between values between the limit and zero on the one hand,
	and values that either are less than or equal to zero or
	are greater than or equal to the limit on the other hand.
	See Section~\ref{sec:count:Exact Limit Counters}.
} \QuickQuizEnd

\QuickQuiz{}
	{ \bfseries Removable I/O device access-count problem. }
	Suppose that you need to maintain a reference count on a
	heavily used removable mass-storage device, so that you
	can tell the user when it is safe to removed the device.
	This device follows the usual removal procedure where
	the user indicates a desire to remove the device, and
	the system tells the user when it is safe to do so.
\QuickQuizAnswer{
	Hint: Yet again, the act of updating the counter must be blazingly
	fast and scalable in order to avoid slowing down I/O operations,
	but because the counter is read out only when the
	user wishes to remove the device, the counter read-out
	operation can be extremely slow.
	Furthermore, there is no need to be able to read out
	the counter at all unless the user has already indicated
	a desire to remove the device.
	In addition, the value read out need not be accurate
	\emph{except} that it absolutely must distinguish perfectly
	between non-zero and zero values, and even then only when
	the device is in the process of being removed.
	However, once it has read out a zero value, it must act
	to keep the value at zero until it has taken some action
	to prevent subsequent threads from gaining access to the
	device being removed.
	See Section~\ref{sec:count:Applying Specialized Parallel Counters}.
} \QuickQuizEnd

The remainder of this chapter will develop answers to these questions.
Section~\ref{sec:count:Why Isn't Concurrent Counting Trivial?}
asks why counting on multicore systems isn't trivial, and
Section~\ref{sec:count:Statistical Counters}
looks into ways of solving the network-packet counting problem.
Section~\ref{sec:count:Approximate Limit Counters}
investigates the approximate structure-allocation limit problem, while
Section~\ref{sec:count:Exact Limit Counters}
takes on the exact structure-allocation limit problem.
Section~\ref{sec:count:Applying Specialized Parallel Counters}
discusses how to use the various specialized parallel counters
introduced in the preceding sections.
Finally, Section~\ref{sec:count:Parallel Counting Discussion}
concludes the chapter with performance measurements.

Sections~\ref{sec:count:Why Isn't Concurrent Counting Trivial?}
and~\ref{sec:count:Statistical Counters}
contain introductory material, while the remaining sections
are more appropriate for advanced students.

\section{Why Isn't Concurrent Counting Trivial?}
\label{sec:count:Why Isn't Concurrent Counting Trivial?}

\begin{figure}[bp]
{ \scriptsize
\begin{verbatim}
  1 long counter = 0;
  2 
  3 void inc_count(void)
  4 {
  5   counter++;
  6 }
  7 
  8 long read_count(void)
  9 {
 10   return counter;
 11 }
\end{verbatim}
}
\caption{Just Count!}
\label{fig:count:Just Count!}
\end{figure}

Let's start with something simple, for example, the straightforward
use of arithmetic shown in
Figure~\ref{fig:count:Just Count!} (\url{count_nonatomic.c}).
Here, we have a counter on line~1, we increment it on line~5, and we
read out its value on line~10.
What could be simpler?

This approach has the additional advantage of being blazingly fast if
you are doing lots of reading and almost no incrementing, and on small
systems, the performance is excellent.

There is just one large fly in the ointment: this approach can lose
counts.
On my dual-core laptop, a short run invoked \co{inc_count()}
100,014,000 times, but the final value of the counter was only
52,909,118.
Although approximate values do have their place in computing,
accuracies far greater than 50\% are almost always necessary.

\QuickQuiz{}
	But doesn't the \co{++} operator produce an x86 add-to-memory
	instruction?
	And won't the CPU cache cause this to be atomic?
\QuickQuizAnswer{
	Although the \co{++} operator \emph{could} be atomic, there
	is no requirement that it be so.
	And indeed, \co{gcc} often
	chooses to load the value to a register, increment
	the register, then store the value to memory, which is
	decidedly non-atomic.
} \QuickQuizEnd

\QuickQuiz{}
	The 8-figure accuracy on the number of failures indicates
	that you really did test this.
	Why would it be necessary to test such a trivial program,
	especially when the bug is easily seen by inspection?
\QuickQuizAnswer{
	Not only are there very few
	trivial parallel programs, and most days I am
	not so sure that there are many trivial sequential programs, either.

	No matter how small or simple the program, if you haven't tested
	it, it does not work.
	And even if you have tested it, Murphy's Law says that there will
	be at least a few bugs still lurking.

	Furthermore, while proofs of correctness certainly do have their
	place, they never will replace testing, including the
	\url{counttorture.h} test setup used here.
	After all, proofs are only as good as the assumptions that they
	are based on.
	Furthermore, proofs can have bugs just as easily as programs can!
} \QuickQuizEnd

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 atomic_t counter = ATOMIC_INIT(0);
  2 
  3 void inc_count(void)
  4 {
  5   atomic_inc(&counter);
  6 }
  7 
  8 long read_count(void)
  9 {
 10   return atomic_read(&counter);
 11 }
\end{verbatim}
}
\caption{Just Count Atomically!}
\label{fig:count:Just Count Atomically!}
\end{figure}

\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{CodeSamples/count/atomic}}
\end{center}
\caption{Atomic Increment Scalability on Nehalem}
\label{fig:count:Atomic Increment Scalability on Nehalem}
\end{figure}

The straightforward way to count accurately is to use atomic operations,
as shown in
Figure~\ref{fig:count:Just Count Atomically!} (\url{count_atomic.c}).
Line~1 defines an atomic variable, line~5 atomically increments it, and
line~10 reads it out.
Because this is atomic, it keeps perfect count.
However, it is slower: on a Intel Core Duo laptop, it is about
six times slower than non-atomic increment
when a single thread is incrementing, and more than \emph{ten times}
slower if two threads are incrementing.\footnote{
	Interestingly enough, a pair of threads non-atomically incrementing
	a counter will cause the counter to increase more quickly than
	a pair of threads atomically incrementing the counter.
	Of course, if your only goal is to make the counter increase
	quickly, an easier approach is to simply assign a large value
	to the counter.
	Nevertheless, there is likely to be a role for algorithms that
	use carefully relaxed notions of correctness in order to gain
	greater performance and
	scalability~\cite{Andrews91textbook,Arcangeli03,DavidUngar2011unsync}.}

This poor performance should not be a surprise, given the discussion in
Chapter~\ref{chp:Hardware and its Habits},
nor should it be a surprise that the performance of atomic increment
gets slower as the number of CPUs and threads increase, as shown in
Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem}.
In this figure, the horizontal dashed line resting on the x~axis
is the ideal performance that would be achieved
by a perfectly scalable algorithm: with such an algorithm, a given
increment would incur the same overhead that it would in a single-threaded
program.
Atomic increment of a single global variable is clearly
decidedly non-ideal, and gets worse as you add CPUs.

\QuickQuiz{}
	Why doesn't the dashed line on the x~axis meet the 
	diagonal line at $x=1$?
\QuickQuizAnswer{
	Because of the overhead of the atomic operation.
	The dashed line on the x~axis represents the overhead of
	a single \emph{non-atomic} increment.
	After all, an \emph{ideal} algorithm would not only scale
	linearly, it would also incur no performance penalty compared
	to single-threaded code.

	This level of idealism may seem severe, but if it is good
	enough for Linus Torvalds, it is good enough for you.
} \QuickQuizEnd

\QuickQuiz{}
	But atomic increment is still pretty fast.
	And incrementing a single variable in a tight loop sounds
	pretty unrealistic to me, after all, most of the program's
	execution should be devoted to actually doing work, not accounting
	for the work it has done!
	Why should I care about making this go faster?
\QuickQuizAnswer{
	In many cases, atomic increment will in fact be fast enough
	for you.
	In those cases, you should by all means use atomic increment.
	That said, there are many real-world situations where
	more elaborate counting algorithms are required.
	The canonical example of such a situation is counting packets
	and bytes in highly optimized networking stacks, where it is
	all too easy to find much of the execution time going into
	these sorts of accounting tasks, especially on large
	multiprocessors.

	In addition, as noted at the beginning of this chapter,
	counting provides an excellent view of the
	issues encountered in shared-memory parallel programs.
} \QuickQuizEnd

\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{count/GlobalInc}}
\end{center}
\caption{Data Flow For Global Atomic Increment}
\label{fig:count:Data Flow For Global Atomic Increment}
\end{figure}

\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{cartoons/One-one-thousand}}
\end{center}
\caption{Waiting to Count}
\ContributedBy{Figure}{fig:count:Waiting to Count}{Melissa Broussard}
\end{figure}

For another perspective on global atomic increment, consider
Figure~\ref{fig:count:Data Flow For Global Atomic Increment}.
In order for each CPU to get a chance to increment a given
global variable, the cache line containing that variable must
circulate among all the CPUs, as shown by the red arrows.
Such circulation will take significant time, resulting in
the poor performance seen in
Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem},
which might be thought of as shown in
Figure~\ref{fig:count:Waiting to Count}.

The following sections discuss high-performance counting, which
avoids the delays inherent in such circulation.

\QuickQuiz{}
	But why can't CPU designers simply ship the addition operation to the
	data, avoiding the need to circulate the cache line containing
	the global variable being incremented?
\QuickQuizAnswer{
	It might well be possible to do this in some cases.
	However, there are a few complications:
	\begin{enumerate}
	\item	If the value of the variable is required, then the
		thread will be forced to wait for the operation
		to be shipped to the data, and then for the result
		to be shipped back.
	\item	If the atomic increment must be ordered with respect
		to prior and/or subsequent operations, then the thread
		will be forced to wait for the operation to be shipped
		to the data, and for an indication that the operation
		completed to be shipped back.
	\item	Shipping operations among CPUs will likely require
		more lines in the system interconnect, which will consume
		more die area and more electrical power.
	\end{enumerate}
	But what if neither of the first two conditions holds?
	Then you should think carefully about the algorithms discussed
	in Section~\ref{sec:count:Statistical Counters}, which achieve
	near-ideal performance on commodity hardware.

\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{count/GlobalTreeInc}}
\end{center}
\caption{Data Flow For Global Combining-Tree Atomic Increment}
\label{fig:count:Data Flow For Global Combining-Tree Atomic Increment}
\end{figure}

	If either or both of the first two conditions hold, there is
	\emph{some} hope for improved hardware.
	One could imagine the hardware implementing a combining tree,
	so that the increment requests from multiple CPUs are combined
	by the hardware into a single addition when the combined request
	reaches the hardware.
	The hardware could also apply an order to the requests, thus
	returning to each CPU the return value corresponding to its
	particular atomic increment.
	This results in instruction latency that varies as $O(log N)$,
	where $N$ is the number of CPUs, as shown in
	Figure~\ref{fig:count:Data Flow For Global Combining-Tree Atomic Increment}.
	And CPUs with this sort of hardware optimization are starting to
	appear as of 2011.

	This is a great improvement over the $O(N)$ performance
	of current hardware shown in
	Figure~\ref{fig:count:Data Flow For Global Atomic Increment},
	and it is possible that hardware latencies might decrease
	further if innovations such as three-dimensional fabrication prove
	practical.
	Nevertheless, we will see that in some important special cases,
	software can do \emph{much} better.
} \QuickQuizEnd

\section{Statistical Counters}
\label{sec:count:Statistical Counters}

This section covers the common special case of statistical counters, where
the count is updated extremely frequently and the value is read out
rarely, if ever.
These will be used to solve the network-packet counting problem
from the Quick Quiz on
page~\pageref{chp:Counting}.

\subsection{Design}

Statistical counting is typically handled by providing a counter per
thread (or CPU, when running in the kernel), so that each thread
updates its own counter.
The aggregate value of the counters is read out by simply summing up
all of the threads' counters,
relying on the commutative and associative properties of addition.
This is an example of the Data Ownership pattern that will be introduced in
Section~\ref{sec:SMPdesign:Data Ownership}.

\QuickQuiz{}
	But doesn't the fact that C's ``integers'' are limited in size
	complicate things?
\QuickQuizAnswer{
	No, because modulo addition is still commutative and associative.
	At least as long as you use unsigned integers.
	Recall that in the C standard, overflow of signed integers results
	in undefined behavior, never mind the fact that machines that
	do anything other than wrap on overflow are quite rare these days.
	Unfortunately, compilers frequently carry out optimizations that
	assume that signed integers will not overflow, so if your code
	allows signed integers to overflow, you can run into trouble
	even on twos-complement hardware.

	That said, one potential source of additional complexity arises
	when attempting to gather (say) a 64-bit sum from 32-bit
	per-thread counters.
	Dealing with this added complexity is left as
	an exercise for the reader, for whom some of the techniques
	introduced later in this chapter could be quite helpful.
} \QuickQuizEnd

\subsection{Array-Based Implementation}
\label{sec:count:Array-Based Implementation}

One way to provide per-thread variables is to allocate an array with
one element per
thread (presumably cache aligned and padded to avoid false sharing).

\QuickQuiz{}
	An array???
	But doesn't that limit the number of threads?
\QuickQuizAnswer{
	It can, and in this toy implementation, it does.
	But it is not that hard to come up with an alternative
	implementation that permits an arbitrary number of threads,
	for example, using the \co{gcc} \co{__thread} facility,
	as shown in
	Section~\ref{sec:count:Per-Thread-Variable-Based Implementation}.
} \QuickQuizEnd

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 DEFINE_PER_THREAD(long, counter);
  2 
  3 void inc_count(void)
  4 {
  5   __get_thread_var(counter)++;
  6 }
  7 
  8 long read_count(void)
  9 {
 10   int t;
 11   long sum = 0;
 12 
 13   for_each_thread(t)
 14     sum += per_thread(counter, t);
 15   return sum;
 16 }
\end{verbatim}
}
\caption{Array-Based Per-Thread Statistical Counters}
\label{fig:count:Array-Based Per-Thread Statistical Counters}
\end{figure}

Such an array can be wrapped into per-thread primitives, as shown in
Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
(\url{count_stat.c}).
Line~1 defines an array containing a set of per-thread counters of
type \co{long} named, creatively enough, \co{counter}.

Lines~3-6 show a function that increments the counters, using the
\co{__get_thread_var()} primitive to locate the currently running
thread's element of the \co{counter} array.
Because this element is modified only by the corresponding thread,
non-atomic increment suffices.

Lines~8-16 show a function that reads out the aggregate value of the counter,
using the \co{for_each_thread()} primitive to iterate over the list of
currently running threads, and using the \co{per_thread()} primitive
to fetch the specified thread's counter.
Because the hardware can fetch and store a properly aligned \co{long}
atomically, and because gcc is kind enough to make use of this capability,
normal loads suffice, and no special atomic instructions are required.

\QuickQuiz{}
	What other choice does gcc have, anyway???
\QuickQuizAnswer{
	According to the C standard, the effects of fetching a variable
	that might be concurrently modified by some other thread are
	undefined.
	It turns out that the C standard really has no other choice,
	given that C must support (for example) eight-bit architectures
	which are incapable of atomically loading a \co{long}.
	An upcoming version of the C standard aims to fill this gap,
	but until then, we depend on the kindness of the gcc developers.

	Alternatively, use of volatile accesses such as those provided
	by \co{ACCESS_ONCE()}~\cite{JonCorbet2012ACCESS:ONCE}
	can help constrain the compiler, at least
	in cases where the hardware is capable of accessing the value
	with a single memory-reference instruction.
} \QuickQuizEnd

\QuickQuiz{}
	How does the per-thread \co{counter} variable in
	Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
	get initialized?
\QuickQuizAnswer{
	The C standard specifies that the initial value of
	global variables is zero, unless they are explicitly initialized.
	So the initial value of all the instances of \co{counter}
	will be zero.
	Furthermore, in the common case where the user is interested only
	in differences between consecutive reads
	from statistical counters, the initial value is irrelevant.
} \QuickQuizEnd

\QuickQuiz{}
	How is the code in
	Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
	supposed to permit more than one counter?
\QuickQuizAnswer{
	Indeed, this toy example does not support more than one counter.
	Modifying it so that it can provide multiple counters is left
	as an exercise to the reader.
} \QuickQuizEnd

\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{count/PerThreadInc}}
\end{center}
\caption{Data Flow For Per-Thread Increment}
\label{fig:count:Data Flow For Per-Thread Increment}
\end{figure}

This approach scales linearly with increasing number of updater threads
invoking \co{inc_count()}.
As is shown by the green arrows in
Figure~\ref{fig:count:Data Flow For Per-Thread Increment},
the reason for this is that each CPU can make rapid progress incrementing
its thread's variable, without any expensive cross-system communication.
As such, this section solves the network-packet counting problem presented
at the beginning of this chapter.

\QuickQuiz{}
	The read operation takes time to sum up the per-thread values,
	and during that time, the counter could well be changing.
	This means that the value returned by
	\co{read_count()} in
	Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
	will not necessarily be exact.
	Assume that the counter is being incremented at rate
	$r$ counts per unit time, and that \co{read_count()}'s
	execution consumes $\Delta$ units of time.
	What is the expected error in the return value?
\QuickQuizAnswer{
	Let's do worst-case analysis first, followed by a less
	conservative analysis.

	In the worst case, the read operation completes immediately,
	but is then delayed for $\Delta$ time units before returning,
	in which case the worst-case error is simply $r \Delta$.

	This worst-case behavior is rather unlikely, so let us instead
	consider the case where the reads from each of the $N$
	counters is spaced equally over the time period $\Delta$.
	There will be $N+1$ intervals of duration $\frac{\Delta}{N+1}$
	between the $N$ reads.
	The error due to the delay after the read from the last thread's
	counter will be given by $\frac{r \Delta}{N \left( N + 1 \right)}$,
	the second-to-last thread's counter by
	$\frac{2 r \Delta}{N \left( N + 1 \right)}$,
	the third-to-last by
	$\frac{3 r \Delta}{N \left( N + 1 \right)}$,
	and so on.
	The total error is given by the sum of the errors due to the
	reads from each thread's counter, which is:

	\begin{equation}
		\frac{r \Delta}{N \left( N + 1 \right)}
			\sum_{i = 1}^N i
	\end{equation}

	Expressing the summation in closed form yields:

	\begin{equation}
		\frac{r \Delta}{N \left( N + 1 \right)}
			\frac{N \left( N + 1 \right)}{2}
	\end{equation}

	Cancelling yields the intuitively expected result:

	\begin{equation}
		\frac{r \Delta}{2}
	\end{equation}

	It is important to remember that error continues accumulating
	as the caller executes code making use of the count returned
	by the read operation.
	For example, if the caller spends time $t$ executing some
	computation based on the result of the returned count, the
	worst-case error will have increased to $r \left(t \Delta \right)$.

	The expected error will have similarly increased to:

	\begin{equation}
		r \left( \frac{\Delta}{2} + t \right)
	\end{equation}

	Of course, it is sometimes unacceptable for the counter to
	continue incrementing during the read operation.
	Section~\ref{sec:count:Applying Specialized Parallel Counters}
	discusses a way to handle this situation.

	All that aside, in most uses of statistical counters, the
	error in the value returned by \co{read_count()} is
	irrelevant.
	This irrelevance is due to the fact that the time required
	for \co{read_count()} to execute is normally extremely
	small compared to the time interval between successive
	calls to \co{read_count()}.
} \QuickQuizEnd

However, this excellent update-side scalability comes at great read-side
expense for large numbers of threads.
The next section shows one way to reduce read-side expense while
still retaining the update-side scalability.

\subsection{Eventually Consistent Implementation}
\label{sec:count:Eventually Consistent Implementation}

One way to retain update-side scalability while greatly improving
read-side performance is to weaken consistency requirements.
The counting algorithm in the previous section is guaranteed to
return a value between the value that an ideal counter would have
taken on near the beginning of \co{read_count()}'s execution and
that near the end of \co{read_count()}'s execution.
\emph{Eventual consistency}~\cite{WernerVogels:2009:EventuallyConsistent}
provides a weaker
guarantee: in absence of calls to \co{inc_count()}, calls to
\co{read_count()} will eventually return an accurate count.

We exploit eventual consistency by maintaining a global counter.
However, updaters only manipulate their per-thread counters.
A separate thread is provided to transfer counts from the per-thread
counters to the global counter.
Readers simply access the value of the global counter.
If updaters are active, the value used by the readers will be out of
date, however, once updates cease, the global counter will eventually
converge on the true value---hence this approach qualifies as
eventually consistent.

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 DEFINE_PER_THREAD(unsigned long, counter);
  2 unsigned long global_count;
  3 int stopflag;
  4 
  5 void inc_count(void)
  6 {
  7   ACCESS_ONCE(__get_thread_var(counter))++;
  8 }
  9 
 10 unsigned long read_count(void)
 11 {
 12   return global_count;
 13 }
 14 
 15 void *eventual(void *arg)
 16 {
 17   int t;
 18   int sum;
 19 
 20   while (stopflag < 3) {
 21     sum = 0;
 22     for_each_thread(t)
 23       sum += per_thread(counter, t);
 24     ACCESS_ONCE(global_count) = sum;
 25     poll(NULL, 0, 1);
 26     if (stopflag) {
 27       smp_mb();
 28       stopflag++;
 29     }
 30   }
 31   return NULL;
 32 }
 33 
 34 void count_init(void)
 35 {
 36   thread_id_t tid;
 37 
 38   if (pthread_create(&tid, NULL, eventual, NULL) != 0) {
 39     perror("count_init:pthread_create");
 40     exit(-1);
 41   }
 42 }
 43 
 44 void count_cleanup(void)
 45 {
 46   stopflag = 1;
 47   while (stopflag < 3)
 48     poll(NULL, 0, 1);
 49   smp_mb();
 50 }
\end{verbatim}
}
\caption{Array-Based Per-Thread Eventually Consistent Counters}
\label{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
\end{figure}

The implementation is shown in
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
(\url{count_stat_eventual.c}).
Lines~1-2 show the per-thread variable and the global variable that
track the counter's value, and line three shows \co{stopflag}
which is used to coordinate termination (for the case where we want
to terminate the program with an accurate counter value).
The \co{inc_count()} function shown on lines~5-8 is similar to its
counterpart in
Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}.
The \co{read_count()} function shown on lines~10-13 simply returns the
value of the \co{global_count} variable.

However, the \co{count_init()} function on lines~34-42
creates the \co{eventual()} thread shown on lines~15-32, which
cycles through all the threads,
summing the per-thread local \co{counter} and storing the
sum to the \co{global_count} variable.
The \co{eventual()} thread waits an arbitrarily chosen one millisecond
between passes.
The \co{count_cleanup()} function on lines~44-50 coordinates termination.

This approach gives extremely fast counter read-out while still
supporting linear counter-update performance.
However, this excellent read-side performance and update-side scalability
comes at the cost of the additional thread running \co{eventual()}.

\QuickQuiz{}
	Why doesn't \co{inc_count()} in
	Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
	need to use atomic instructions?
	After all, we now have multiple threads accessing the per-thread
	counters!
\QuickQuizAnswer{
	Because one of the two threads only reads, and because the
	variable is aligned and machine-sized, non-atomic instructions
	suffice.
	That said, the \co{ACCESS_ONCE()} macro is used to prevent
	compiler optimizations that might otherwise prevent the
	counter updates from becoming visible to
	\co{eventual()}~\cite{JonCorbet2012ACCESS:ONCE}.

	An older version of this algorithm did in fact use atomic
	instructions, kudos to Ersoy Bayramoglu for pointing out that
	they are in fact unnecessary.
	That said, atomic instructions would be needed in cases where
	the per-thread \co{counter} variables were smaller than the
	global \co{global_count}.
	However, note that on a 32-bit system,
	the per-thread \co{counter} variables
	might need to be limited to 32 bits in order to sum them accurately,
	but with a 64-bit \co{global_count} variable to avoid overflow.
	In this case, it is necessary to zero the per-thead
	\co{counter} variables periodically in order to avoid overflow.
	It is extremely important to note that this zeroing cannot
	be delayed too long or overflow of the smaller per-thread
	variables will result.
	This approach therefore imposes real-time requirements on the
	underlying system, and in turn must be used with extreme care.

	In contrast, if all variables are the same size, overflow
	of any variable is harmless because the eventual sum
	will be modulo the word size.
} \QuickQuizEnd

\QuickQuiz{}
	Won't the single global thread in the function \co{eventual()} of
	Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
	be just as severe a bottleneck as a global lock would be?
\QuickQuizAnswer{
	In this case, no.
	What will happen instead is that as the number of threads increases,
	the estimate of the counter
	value returned by \co{read_count()} will become more inaccurate.
} \QuickQuizEnd

\QuickQuiz{}
	Won't the estimate returned by \co{read_count()} in
	Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
	become increasingly
	inaccurate as the number of threads rises?
\QuickQuizAnswer{
	Yes.
	If this proves problematic, one fix is to provide multiple
	\co{eventual()} threads, each covering its own subset of
	the other threads.
	In more extreme cases, a tree-like hierarchy of
	\co{eventual()} threads might be required.
} \QuickQuizEnd

\QuickQuiz{}
	Given that in the eventually-consistent algorithm shown in
	Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
	both reads and updates have extremely low overhead
	and are extremely scalable, why would anyone bother with the
	implementation described in
	Section~\ref{sec:count:Array-Based Implementation},
	given its costly read-side code?
\QuickQuizAnswer{
	The thread executing \co{eventual()} consumes CPU time.
	As more of these eventually-consistent counters are added,
	the resulting \co{eventual()} threads will eventually
	consume all available CPUs.
	This implementation therefore suffers a different sort of
	scalability limitation, with the scalability limit being in
	terms of the number of eventually consistent counters rather
	than in terms of the number of threads or CPUs.
} \QuickQuizEnd

\subsection{Per-Thread-Variable-Based Implementation}
\label{sec:count:Per-Thread-Variable-Based Implementation}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 long __thread counter = 0;
  2 long *counterp[NR_THREADS] = { NULL };
  3 long finalcount = 0;
  4 DEFINE_SPINLOCK(final_mutex);
  5 
  6 void inc_count(void)
  7 {
  8   counter++;
  9 }
 10 
 11 long read_count(void)
 12 {
 13   int t;
 14   long sum;
 15 
 16   spin_lock(&final_mutex);
 17   sum = finalcount;
 18   for_each_thread(t)
 19     if (counterp[t] != NULL)
 20       sum += *counterp[t];
 21   spin_unlock(&final_mutex);
 22   return sum;
 23 }
 24 
 25 void count_register_thread(void)
 26 {
 27   int idx = smp_thread_id();
 28 
 29   spin_lock(&final_mutex);
 30   counterp[idx] = &counter;
 31   spin_unlock(&final_mutex);
 32 }
 33 
 34 void count_unregister_thread(int nthreadsexpected)
 35 {
 36   int idx = smp_thread_id();
 37 
 38   spin_lock(&final_mutex);
 39   finalcount += counter;
 40   counterp[idx] = NULL;
 41   spin_unlock(&final_mutex);
 42 }
\end{verbatim}
}
\caption{Per-Thread Statistical Counters}
\label{fig:count:Per-Thread Statistical Counters}
\end{figure}

Fortunately, gcc provides an \co{__thread} storage class that provides
per-thread storage.
This can be used as shown in
Figure~\ref{fig:count:Per-Thread Statistical Counters} (\url{count_end.c})
to implement
a statistical counter that not only scales, but that also incurs little
or no performance penalty to incrementers compared to simple non-atomic
increment.

Lines~1-4 define needed variables: \co{counter} is the per-thread counter
variable, the \co{counterp[]} array allows threads to access each others'
counters, finalcount accumulates the total as individual threads exit,
and \co{final_mutex} coordinates between threads accumulating the total
value of the counter and exiting threads.

\QuickQuiz{}
	Why do we need an explicit array to find the other threads'
	counters?
	Why doesn't gcc provide a \co{per_thread()} interface, similar
	to the Linux kernel's \co{per_cpu()} primitive, to allow
	threads to more easily access each others' per-thread variables?
\QuickQuizAnswer{
	Why indeed?

	To be fair, gcc faces some challenges that the Linux kernel
	gets to ignore.
	When a user-level thread exits, its per-thread variables all
	disappear, which complicates the problem of per-thread-variable
	access, particularly before the advent of user-level RCU
	(see Section~\ref{sec:defer:Read-Copy Update (RCU)}).
	In contrast, in the Linux kernel, when a CPU goes offline,
	that CPU's per-CPU variables remain mapped and accessible.

	Similarly, when a new user-level thread is created, its
	per-thread variables suddenly come into existence.
	In contrast, in the Linux kernel, all per-CPU variables are
	mapped and initialized at boot time, regardless of whether
	the corresponding CPU exists yet, or indeed, whether the
	corresponding CPU will ever exist.

	A key limitation that the Linux kernel imposes is a compile-time
	maximum bound on the number of CPUs, namely, \co{CONFIG_NR_CPUS},
	along with a typically tighter boot-time bound of \co{nr_cpu_ids}.
	In contrast, in user space, there is no hard-coded upper limit
	on the number of threads.

	Of course, both environments must handle dynamically loaded
	code (dynamic libraries in user space, kernel modules in the
	Linux kernel), which increases the complexity of per-thread
	variables.

	These complications make it significantly harder for user-space
	environments to provide access to other threads' per-thread
	variables.
	Nevertheless, such access is highly useful, and it is hoped
	that it will someday appear.
} \QuickQuizEnd

The \co{inc_count()} function used by updaters is quite simple, as can
be seen on lines~6-9.

The \co{read_count()} function used by readers is a bit more complex.
Line~16 acquires a lock to exclude exiting threads, and line~21 releases
it.
Line~17 initializes the sum to the count accumulated by those threads that
have already exited, and lines~18-20 sum the counts being accumulated
by threads currently running.
Finally, line~22 returns the sum.

\QuickQuiz{}
	Why on earth do we need something as heavyweight as a \emph{lock}
	guarding the summation in the function \co{read_count()} in
	Figure~\ref{fig:count:Per-Thread Statistical Counters}?
\QuickQuizAnswer{
	Remember, when a thread exits, its per-thread variables disappear.
	Therefore, if we attempt to access a given thread's per-thread
	variables after that thread exits, we will get a segmentation
	fault.
	The lock coordinates summation and thread exit, preventing this
	scenario.

	Of course, we could instead read-acquire a reader-writer lock,
	but Chapter~\ref{chp:Deferred Processing} will introduce even
	lighter-weight mechanisms for implementing the required coordination.
} \QuickQuizEnd

Lines~25-32 show the \co{count_register_thread()} function, which
must be called by each thread before its first use of this counter.
This function simply sets up this thread's element of the \co{counterp[]}
array to point to its per-thread \co{counter} variable.

\QuickQuiz{}
	Why on earth do we need to acquire the lock in
	\co{count_register_thread()} in
	Figure~\ref{fig:count:Per-Thread Statistical Counters}?
	It is a single properly aligned machine-word store to a location
	that no other thread is modifying, so it should be atomic anyway,
	right?
\QuickQuizAnswer{
	This lock could in fact be omitted, but better safe than
	sorry, especially given that this function is executed only at
	thread startup, and is therefore not on any critical path.
	Now, if we were testing on machines with thousands of CPUs,
	we might need to omit the lock, but on machines with ``only''
	a hundred or so CPUs, there is no need to get fancy.
} \QuickQuizEnd

Lines~34-42 show the \co{count_unregister_thread()} function, which
must be called prior to exit by each thread that previously called
\co{count_register_thread()}.
Line~38 acquires the lock, and line~41 releases it, thus excluding any
calls to \co{read_count()} as well as other calls to
\co{count_unregister_thread()}.
Line~39 adds this thread's \co{counter} to the global \co{finalcount},
and then line~40 \co{NULL}s out its \co{counterp[]} array entry.
A subsequent call to \co{read_count()} will see the exiting thread's
count in the global \co{finalcount}, and will skip the exiting thread
when sequencing through the \co{counterp[]} array, thus obtaining
the correct total.

This approach gives updaters almost exactly the same performance as
a non-atomic add, and also scales linearly.
On the other hand, concurrent reads contend for a single global lock,
and therefore perform poorly and scale abysmally.
However, this is not a problem for statistical counters, where incrementing
happens often and readout happens almost never.
Of course, this approach is considerably more complex than the
array-based scheme, due to the fact that a given thread's per-thread
variables vanish when that thread exits.

\QuickQuiz{}
	Fine, but the Linux kernel doesn't have to acquire a lock
	when reading out the aggregate value of per-CPU counters.
	So why should user-space code need to do this???
\QuickQuizAnswer{
	Remember, the Linux kernel's per-CPU variables are always
	accessible, even if the corresponding CPU is offline --- even
	if the corresponding CPU never existed and never will exist.

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 long __thread counter = 0;
  2 long *counterp[NR_THREADS] = { NULL };
  3 int finalthreadcount = 0;
  4 DEFINE_SPINLOCK(final_mutex);
  5 
  6 void inc_count(void)
  7 {
  8   counter++;
  9 }
 10 
 11 long read_count(void)
 12 {
 13   int t;
 14   long sum = 0;
 15 
 16   for_each_thread(t)
 17     if (counterp[t] != NULL)
 18       sum += *counterp[t];
 19   return sum;
 20 }
 21 
 22 void count_init(void)
 23 {
 24 }
 25 
 26 void count_register_thread(void)
 27 {
 28   counterp[smp_thread_id()] = &counter;
 29 }
 30 
 31 void count_unregister_thread(int nthreadsexpected)
 32 {
 33   spin_lock(&final_mutex);
 34   finalthreadcount++;
 35   spin_unlock(&final_mutex);
 36   while (finalthreadcount < nthreadsexpected)
 37     poll(NULL, 0, 1);
 38 }
\end{verbatim}
}
\caption{Per-Thread Statistical Counters With Lockless Summation}
\label{fig:count:Per-Thread Statistical Counters With Lockless Summation}
\end{figure}

	One workaround is to ensure that each thread continues to exist
	until all threads are finished, as shown in
	Figure~\ref{fig:count:Per-Thread Statistical Counters With Lockless Summation}
	(\co{count_tstat.c}).
	Analysis of this code is left as an exercise to the reader,
	however, please note that it does not fit well into the
	\url{counttorture.h} counter-evaluation scheme.
	(Why not?)
	Chapter~\ref{chp:Deferred Processing} will introduce 
	synchronization mechanisms that handle this situation in a much
	more graceful manner.
} \QuickQuizEnd

\subsection{Discussion}

These three implementations show that it is possible to obtain uniprocessor
performance for statistical counters, despite running on a parallel
machine.

\QuickQuiz{}
	What fundamental difference is there between counting packets
	and counting the total number of bytes in the packets, given
	that the packets vary in size?
\QuickQuizAnswer{
	When counting packets, the counter is only incremented by the
	value one.
	On the other hand, when counting bytes, the counter might
	be incremented by largish numbers.

	Why does this matter?
	Because in the increment-by-one case, the value returned will
	be exact in the sense that the counter must necessarily have
	taken on that value at some point in time, even if it is impossible
	to say precisely when that point occurred.
	In contrast, when counting bytes, two different threads might
	return values that are inconsistent with any global ordering
	of operations.

	To see this, suppose that thread~0 adds the value three to its
	counter, thread~1 adds the value five to its counter, and
	threads~2 and 3 sum the counters.
	If the system is ``weakly ordered'' or if the compiler
	uses aggressive optimizations, thread~2 might find the
	sum to be three and thread~3 might find the sum to be five.
	The only possible global orders of the sequence of values
	of the counter are 0,3,8 and 0,5,8, and neither order is
	consistent with the results obtained.

	If you missed this one, you are not alone.
	Michael Scott used this question to stump Paul E.~McKenney
	during Paul's Ph.D. defense.
} \QuickQuizEnd

\QuickQuiz{}
	Given that the reader must sum all the threads' counters,
	this could take a long time given large numbers of threads.
	Is there any way that the increment operation can remain
	fast and scalable while allowing readers to also enjoy
	reasonable performance and scalability?
\QuickQuizAnswer{
	One approach would be to maintain a global approximation
	to the value.
	Readers would increment their per-thread variable, but when it
	reached some predefined limit, atomically add it to a global
	variable, then zero their per-thread variable.
	This would permit a tradeoff between average increment overhead
	and accuracy of the value read out.

	The reader is encouraged to think up and try out other approaches,
	for example, using a combining tree.
} \QuickQuizEnd

Given what has been presented in this section, you should now be able
to answer the Quick Quiz about statistical counters for networking
near the beginning of this chapter.

\section{Approximate Limit Counters}
\label{sec:count:Approximate Limit Counters}

Another special case of counting involves limit-checking.
For example, as noted in the approximate structure-allocation limit
problem in the Quick Quiz on
page~\pageref{chp:Counting},
suppose that you need to maintain a count of the number of
structures allocated in order to fail any allocations once the number
of structures in use exceeds a limit, in this case, 10,000.
Suppose further that these structures are short-lived, that this
limit is rarely exceeded, and that this limit is approximate in
that it is OK to exceed it sometimes by some bounded amount
(see Section~\ref{sec:count:Exact Limit Counters}
if you instead need the limit to be exact).

\subsection{Design}

One possible design for limit counters is to divide the limit of 10,000
by the number of threads, and give each thread a fixed pool of structures.
For example, given 100 threads, each thread would manage its own pool
of 100 structures.
This approach is simple, and in some cases works well, but it does not
handle the common case where a given structure is allocated by one
thread and freed by another~\cite{McKenney93}.
On the one hand, if a given thread takes credit for any structures it
frees, then the thread doing most of the allocating runs out
of structures, while the threads doing most of the freeing have lots
of credits that they cannot use.
On the other hand, if freed structures are credited to the CPU that
allocated them, it will be necessary for CPUs to manipulate each
others' counters, which will require expensive atomic instructions
or other means of communicating between threads.\footnote{
	That said, if each structure will always be freed
	by the same CPU (or thread) that allocated it, then
	this simple partitioning approach works extremely well.}

In short, for many important workloads, we cannot fully partition the counter.
Given that partitioning the counters was what brought the excellent
update-side performance for the three schemes discussed in
Section~\ref{sec:count:Statistical Counters}, this might be grounds
for some pessimism.
However, the eventually consistent algorithm presented in
Section~\ref{sec:count:Eventually Consistent Implementation}
provides an interesting hint.
Recall that this algorithm kept two sets of books, a
per-thread \co{counter} variable for updaters and a \co{global_count}
variable for readers, with an \co{eventual()} thread that periodically
updated \co{global_count} to be eventually consistent with the values
of the per-thread \co{counter}.
The per-thread \co{counter} perfectly partitioned the counter value, while
\co{global_count} kept the full value.

For limit counters, we can use a variation on this theme, in that
we \emph{partially partition} the counter.
For example, each of four threads could have a per-thread
\co{counter}, but each could also have a per-thread maximum value
(call it \co{countermax}).

But then what happens if a given thread needs to increment its
\co{counter}, but \co{counter} is equal to its \co{countermax}?
The trick here is to move half of that thread's \co{counter} value
to a \co{globalcount}, then increment \co{counter}.
For example, if a given thread's \co{counter} and \co{countermax}
variables were both equal to 10, we do the following:

\begin{enumerate}
\item	Acquire a global lock.
\item	Add five to \co{globalcount}.
\item	To balance out the addition, subtract five from this
	thread's \co{counter}.
\item	Release the global lock.
\item	Increment this thread's \co{counter}, resulting in a value of six.
\end{enumerate}

Although this procedure still requires a global lock, that lock need only be
acquired once for every five increment operations, greatly reducing
that lock's level of contention.
We can reduce this contention as low as we wish by increasing the
value of \co{countermax}.
However, the corresponding penalty for increasing the value of
\co{countermax} is reduced accuracy of \co{globalcount}.
To see this, note that on a four-CPU system, if \co{countermax}
is equal to ten, \co{globalcount} will be in error by at most
40 counts.
In contrast, if \co{countermax} is increased to 100, \co{globalcount}
might be in error by as much as 400 counts.

This raises the question of just how much we care about \co{globalcount}'s
deviation from the aggregate value of the counter, where this aggregate value
is the sum of \co{globalcount} and each thread's \co{counter} variable.
The answer to this question depends on how far the aggregate value is
from the counter's limit (call it \co{globalcountmax}).
The larger the difference between these two values, the larger \co{countermax}
can be without risk of exceeding the \co{globalcountmax} limit.
This means that the
value of a given thread's \co{countermax} variable can be set
based on this difference.
When far from the limit, the \co{countermax} per-thread variables are
set to large values to optimize for performance and scalability, while
when close to the limit, these same variables are set to small values
to minimize the error in the checks against the \co{globalcountmax} limit.

This design is an example of \emph{parallel fastpath}, which is an important
design pattern in which the common case executes with no expensive
instructions and no interactions between threads, but where occasional
use is also made of a more conservatively designed
(and higher overhead) global algorithm.
This design pattern is covered in more detail in
Section~\ref{sec:SMPdesign:Parallel Fastpath}.

\subsection{Simple Limit Counter Implementation}
\label{sec:count:Simple Limit Counter Implementation}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 unsigned long __thread counter = 0;
  2 unsigned long __thread countermax = 0;
  3 unsigned long globalcountmax = 10000;
  4 unsigned long globalcount = 0;
  5 unsigned long globalreserve = 0;
  6 unsigned long *counterp[NR_THREADS] = { NULL };
  7 DEFINE_SPINLOCK(gblcnt_mutex);
\end{verbatim}
}
\caption{Simple Limit Counter Variables}
\label{fig:count:Simple Limit Counter Variables}
\end{figure}

\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{count/count_lim}}
\end{center}
\caption{Simple Limit Counter Variable Relationships}
\label{fig:count:Simple Limit Counter Variable Relationships}
\end{figure}

Figure~\ref{fig:count:Simple Limit Counter Variables}
shows both the per-thread and global variables used by this
implementation.
The per-thread \co{counter} and \co{countermax} variables are the
corresponding thread's local counter and the upper bound on that
counter, respectively.
The \co{globalcountmax} variable on line~3 contains the upper
bound for the aggregate counter, and the \co{globalcount} variable
on line~4 is the global counter.
The sum of \co{globalcount} and each thread's \co{counter} gives
the aggregate value of the overall counter.
The \co{globalreserve} variable on line~5 is the sum of all of the
per-thread \co{countermax} variables.
The relationship among these variables is shown by
Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}:
\begin{enumerate}
\item	The sum of \co{globalcount} and \co{globalreserve} must
	be less than or equal to \co{globalcountmax}.
\item	The sum of all threads' \co{countermax} values must be
	less than or equal to \co{globalreserve}.
\item	Each thread's \co{counter} must be less than or equal to
	that thread's \co{countermax}.
\end{enumerate}

Each element of the \co{counterp[]} array references the corresponding
thread's \co{counter} variable, and, finally, the \co{gblcnt_mutex}
spinlock guards all of the global variables, in other words, no thread
is permitted to access or modify any of the global variables unless it
has acquired \co{gblcnt_mutex}.

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 int add_count(unsigned long delta)
  2 {
  3   if (countermax - counter >= delta) {
  4     counter += delta;
  5     return 1;
  6   }
  7   spin_lock(&gblcnt_mutex);
  8   globalize_count();
  9   if (globalcountmax -
 10       globalcount - globalreserve < delta) {
 11     spin_unlock(&gblcnt_mutex);
 12     return 0;
 13   }
 14   globalcount += delta;
 15   balance_count();
 16   spin_unlock(&gblcnt_mutex);
 17   return 1;
 18 }
 19 
 20 int sub_count(unsigned long delta)
 21 {
 22   if (counter >= delta) {
 23     counter -= delta;
 24     return 1;
 25   }
 26   spin_lock(&gblcnt_mutex);
 27   globalize_count();
 28   if (globalcount < delta) {
 29     spin_unlock(&gblcnt_mutex);
 30     return 0;
 31   }
 32   globalcount -= delta;
 33   balance_count();
 34   spin_unlock(&gblcnt_mutex);
 35   return 1;
 36 }
 37 
 38 unsigned long read_count(void)
 39 {
 40   int t;
 41   unsigned long sum;
 42 
 43   spin_lock(&gblcnt_mutex);
 44   sum = globalcount;
 45   for_each_thread(t)
 46     if (counterp[t] != NULL)
 47       sum += *counterp[t];
 48   spin_unlock(&gblcnt_mutex);
 49   return sum;
 50 }
\end{verbatim}
}
\caption{Simple Limit Counter Add, Subtract, and Read}
\label{fig:count:Simple Limit Counter Add, Subtract, and Read}
\end{figure}

Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}
shows the \co{add_count()}, \co{sub_count()}, and \co{read_count()}
functions (\url{count_lim.c}).


\QuickQuiz{}
	Why does
	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}
	provide \co{add_count()} and \co{sub_count()} instead of the
	\co{inc_count()} and \co{dec_count()} interfaces show in
	Section~\ref{sec:count:Statistical Counters}?
\QuickQuizAnswer{
	Because structures come in different sizes.
	Of course, a limit counter corresponding to a specific size
	of structure might still be able to use
	\co{inc_count()} and \co{dec_count()}.
} \QuickQuizEnd

Lines~1-18 show \co{add_count()}, which adds the specified value \co{delta}
to the counter.
Line~3 checks to see if there is room for \co{delta} on this thread's
\co{counter}, and, if so, line~4 adds it and line~6 returns success.
This is the \co{add_counter()} fastpath, and it does no atomic operations,
references only per-thread variables, and should not incur any cache misses.

\QuickQuiz{}
	What is with the strange form of the condition on line~3 of
	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
	Why not the following more intuitive form of the fastpath?

	\vspace{5pt}
	\begin{minipage}[t]{\columnwidth}
	\small
	\begin{verbatim}
  3 if (counter + delta <= countermax){
  4   counter += delta;
  5   return 1;
  6 }
	\end{verbatim}
	\end{minipage}
	\vspace{5pt}
\QuickQuizAnswer{
	Two words.
	``Integer overflow.''

	Try the above formulation with \co{counter} equal to 10 and
	\co{delta} equal to \co{ULONG_MAX}.
	Then try it again with the code shown in
	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}.

	A good understanding of integer overflow will be required for
	the rest of this example, so if you have never dealt with
	integer overflow before, please try several examples to get
	the hang of it.
	Integer overflow can sometimes be more difficult to get right
	than parallel algorithms!
} \QuickQuizEnd

If the test on line~3 fails, we must access global variables, and thus
must acquire \co{gblcnt_mutex} on line~7, which we release on line~11
in the failure case or on line~16 in the success case.
Line~8 invokes \co{globalize_count()}, shown in
Figure~\ref{fig:count:Simple Limit Counter Utility Functions},
which clears the thread-local variables, adjusting the global variables
as needed, thus simplifying global processing.
(But don't take \emph{my} word for it, try coding it yourself!)
Lines~9 and 10 check to see if addition of \co{delta} can be accommodated,
with the meaning of the expression preceding the less-than sign shown in
Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}
as the difference in height of the two red bars.
If the addition of \co{delta} cannot be accommodated, then
line~11 (as noted earlier) releases \co{gblcnt_mutex} and line~12
returns indicating failure.

Otherwise, we take the slowpath.
Line~14 adds \co{delta} to \co{globalcount}, and then
line~15 invokes \co{balance_count()} (shown in
Figure~\ref{fig:count:Simple Limit Counter Utility Functions})
in order to update both the global and the per-thread variables.
This call to \co{balance_count()}
will usually set this thread's \co{countermax} to re-enable the fastpath.
Line~16 then releases
\co{gblcnt_mutex} (again, as noted earlier), and, finally,
line~17 returns indicating success.

\QuickQuiz{}
	Why does \co{globalize_count()} zero the per-thread variables,
	only to later call \co{balance_count()} to refill them in
	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
	Why not just leave the per-thread variables non-zero?
\QuickQuizAnswer{
	That is in fact what an earlier version of this code did.
	But addition and subtraction are extremely cheap, and handling
	all of the special cases that arise is quite complex.
	Again, feel free to try it yourself, but beware of integer
	overflow!
} \QuickQuizEnd

Lines~20-36 show \co{sub_count()}, which subtracts the specified
\co{delta} from the counter.
Line~22 checks to see if the per-thread counter can accommodate
this subtraction, and, if so, line~23 does the subtraction and
line~24 returns success.
These lines form \co{sub_count()}'s fastpath, and, as with
\co{add_count()}, this fastpath executes no costly operations.

If the fastpath cannot accommodate subtraction of \co{delta},
execution proceeds to the slowpath on lines~26-35.
Because the slowpath must access global state, line~26
acquires \co{gblcnt_mutex}, which is released either by line~29
(in case of failure) or by line~34 (in case of success).
Line~27 invokes \co{globalize_count()}, shown in
Figure~\ref{fig:count:Simple Limit Counter Utility Functions},
which again clears the thread-local variables, adjusting the global variables
as needed.
Line~28 checks to see if the counter can accommodate subtracting
\co{delta}, and, if not, line~29 releases \co{gblcnt_mutex}
(as noted earlier) and line~30 returns failure.

\QuickQuiz{}
	Given that \co{globalreserve} counted against us in \co{add_count()},
	why doesn't it count for us in \co{sub_count()} in
	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
\QuickQuizAnswer{
	The \co{globalreserve} variable tracks the sum of all threads'
	\co{countermax} variables.
	The sum of these threads' \co{counter} variables might be anywhere
	from zero to \co{globalreserve}.
	We must therefore take a conservative approach, assuming that
	all threads' \co{counter} variables are full in \co{add_count()}
	and that they are all empty in \co{sub_count()}.

	But remember this question, as we will come back to it later.
} \QuickQuizEnd

\QuickQuiz{}
	Suppose that one thread invokes \co{add_count()} shown in
	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read},
	and then another thread invokes \co{sub_count()}.
	Won't \co{sub_count()} return failure even though the value of
	the counter is non-zero?
\QuickQuizAnswer{
	Indeed it will!
	In many cases, this will be a problem, as discussed in
	Section~\ref{sec:count:Simple Limit Counter Discussion}, and
	in those cases the algorithms from
	Section~\ref{sec:count:Exact Limit Counters}
	will likely be preferable.
} \QuickQuizEnd

If, on the other hand, line~28 finds that the counter \emph{can}
accommodate subtracting \co{delta}, we complete the slowpath.
Line~32 does the subtraction and then
line~33 invokes \co{balance_count()} (shown in
Figure~\ref{fig:count:Simple Limit Counter Utility Functions})
in order to update both global and per-thread variables
(hopefully re-enabling the fastpath).
Then line~34 releases \co{gblcnt_mutex}, and line~35 returns success.

\QuickQuiz{}
	Why have both \co{add_count()} and \co{sub_count()} in
	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
	Why not simply pass a negative number to \co{add_count()}?
\QuickQuizAnswer{
	Given that \co{add_count()} takes an \co{unsigned} \co{long}
	as its argument, it is going to be a bit tough to pass it a
	negative number.
	And unless you have some anti-matter memory, there is little
	point in allowing negative numbers when counting the number
	of structures in use!
} \QuickQuizEnd

Lines~38-50 show \co{read_count()}, which returns the aggregate value
of the counter.
It acquires \co{gblcnt_mutex} on line~43 and releases it on line 48,
excluding global operations from \co{add_count()} and \co{sub_count()},
and, as we will see, also excluding thread creation and exit.
Line~44 initializes local variable \co{sum} to the value of
\co{globalcount}, and then the loop spanning lines~45-47 sums the
per-thread \co{counter} variables.
Line~49 then returns the sum.

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 static void globalize_count(void)
  2 {
  3   globalcount += counter;
  4   counter = 0;
  5   globalreserve -= countermax;
  6   countermax = 0;
  7 }
  8 
  9 static void balance_count(void)
 10 {
 11   countermax = globalcountmax -
 12                globalcount - globalreserve;
 13   countermax /= num_online_threads();
 14   globalreserve += countermax;
 15   counter = countermax / 2;
 16   if (counter > globalcount)
 17     counter = globalcount;
 18   globalcount -= counter;
 19 }
 20 
 21 void count_register_thread(void)
 22 {
 23   int idx = smp_thread_id();
 24 
 25   spin_lock(&gblcnt_mutex);
 26   counterp[idx] = &counter;
 27   spin_unlock(&gblcnt_mutex);
 28 }
 29 
 30 void count_unregister_thread(int nthreadsexpected)
 31 {
 32   int idx = smp_thread_id();
 33 
 34   spin_lock(&gblcnt_mutex);
 35   globalize_count();
 36   counterp[idx] = NULL;
 37   spin_unlock(&gblcnt_mutex);
 38 }
\end{verbatim}
}
\caption{Simple Limit Counter Utility Functions}
\label{fig:count:Simple Limit Counter Utility Functions}
\end{figure}

Figure~\ref{fig:count:Simple Limit Counter Utility Functions}
shows a number of utility functions used by the \co{add_count()}
\co{sub_count()}, and \co{read_count()} primitives shown in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}.

Lines~1-7 show \co{globalize_count()}, which zeros the current thread's
per-thread counters, adjusting the global variables appropriately.
It is important to note that this function does not change the aggregate
value of the counter, but instead changes how the counter's current value
is represented.
Line~3 adds the thread's \co{counter} variable to \co{globalcount},
and line~4 zeroes \co{counter}.
Similarly, line~5 subtracts the per-thread \co{countermax} from
\co{globalreserve}, and line~6 zeroes \co{countermax}.
It is helpful to refer to
Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}
when reading both this function and \co{balance_count()}, which is next.

Lines~9-19 show \co{balance_count()}, which is roughly speaking
the inverse of \co{globalize_count()}.
This function's job is to set the current thread's
\co{countermax} variable to the largest value that avoids the risk
of the counter exceeding the \co{globalcountmax} limit.
Changing the current thread's \co{countermax} variable of course
requires corresponding adjustments to \co{counter}, \co{globalcount}
and \co{globalreserve}, as can be seen by referring back to
Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}.
By doing this, \co{balance_count()} maximizes use of
\co{add_count()}'s and \co{sub_count()}'s low-overhead fastpaths.
As with \co{globalize_count()}, \co{balance_count()} is not permitted
to change the aggregate value of the counter.

Lines~11-13 compute this thread's share of that portion of
\co{globalcountmax} that is not already covered by either
\co{globalcount} or \co{globalreserve}, and assign the
computed quantity to this thread's \co{countermax}.
Line~14 makes the corresponding adjustment to \co{globalreserve}.
Line~15 sets this thread's \co{counter} to the middle of the range
from zero to \co{countermax}.
Line~16 checks to see whether \co{globalcount} can in fact accommodate
this value of \co{counter}, and, if not, line~17 decreases \co{counter}
accordingly.
Finally, in either case, line~18 makes the corresponding adjustment to
\co{globalcount}.

\QuickQuiz{}
	Why set \co{counter} to \co{countermax / 2} in line~15 of
	Figure~\ref{fig:count:Simple Limit Counter Utility Functions}?
	Wouldn't it be simpler to just take \co{countermax} counts?
\QuickQuizAnswer{
	First, it really is reserving \co{countermax} counts
	(see line~14), however,
	it adjusts so that only half of these are actually in use
	by the thread at the moment.
	This allows the thread to carry out at least \co{countermax / 2}
	increments or decrements before having to refer back to
	\co{globalcount} again.

	Note that the accounting in \co{globalcount} remains accurate,
	thanks to the adjustment in line~18.
} \QuickQuizEnd

\begin{figure*}[tb]
\begin{center}
\resizebox{5in}{!}{\includegraphics{count/globbal}}
\end{center}
\caption{Schematic of Globalization and Balancing}
\label{fig:count:Schematic of Globalization and Balancing}
\end{figure*}

It is helpful to look at a schematic depicting how the relationship
of the counters changes with the execution of first
\co{globalize_count()} and then \co{balance_count}, as shown in
Figure~\ref{fig:count:Schematic of Globalization and Balancing}.
Time advances from left to right, with the leftmost configuration
roughly that of
Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}.
The center configuration shows the relationship of these same counters
after \co{globalize_count()} is executed by thread~0.
As can be seen from the figure, thread~0's \co{counter} (``c~0'' in
the figure) is added to \co{globalcount}, while the value of
\co{globalreserve} is reduced by this same amount.
Both thread~0's \co{counter} and its \co{countermax}
(``cm~0'' in the figure) are reduced to zero.
The other three threads' counters are unchanged.
Note that this change did not affect the overall value of the counter,
as indicated by the bottommost dotted line connecting the leftmost
and center configurations.
In other words, the sum of \co{globalcount} and the four threads'
\co{counter} variables is the same in both configurations.
Similarly, this change did not affect the sum of \co{globalcount} and
\co{globalreserve}, as indicated by the upper dotted line.

The rightmost configuration shows the relationship of these counters
after \co{balance_count()} is executed, again by thread~0.
One-quarter of the remaining count, denoted by the vertical line extending
up from all three configurations, is added to thread~0's
\co{countermax} and half of that to thread~0's \co{counter}.
The amount added to thread~0's \co{counter} is also subtracted from
\co{globalcount} in order to avoid changing the overall value of the
counter (which is again the sum of \co{globalcount} and the three
threads' \co{counter} variables), again as indicated by the lowermost
of the two dotted lines connecting the center and rightmost configurations.
The \co{globalreserve} variable is also adjusted so that this variable
remains equal to the sum of the four threads' \co{countermax}
variables.
Because thread~0's \co{counter} is less than its \co{countermax},
thread~0 can once again increment the counter locally.

\QuickQuiz{}
	In Figure~\ref{fig:count:Schematic of Globalization and Balancing},
	even though a quarter of the remaining count up to the limit is
	assigned to thread~0, only an eighth of the remaining count is
	consumed, as indicated by the uppermost dotted line connecting
	the center and the rightmost configurations.
	Why is that?
\QuickQuizAnswer{
	The reason this happened is that thread~0's \co{counter} was
	set to half of its \co{countermax}.
	Thus, of the quarter assigned to thread~0, half of that quarter
	(one eighth) came from \co{globalcount}, leaving the other half
	(again, one eighth) to come from the remaining count.

	There are two purposes for taking this approach:
	(1)~To allow thread~0 to use the fastpath for decrements as
	well as increments, and
	(2)~To reduce the inaccuracies if all threads are monotonically
	incrementing up towards the limit.
	To see this last point, step through the algorithm and watch
	what it does.
} \QuickQuizEnd

Lines~21-28 show \co{count_register_thread()}, which sets up state for
newly created threads.
This function simply installs
a pointer to the newly created thread's \co{counter} variable into
the corresponding entry of the \co{counterp[]} array under the protection
of \co{gblcnt_mutex}.

Finally, lines~30-38 show \co{count_unregister_thread()}, which tears down
state for a soon-to-be-exiting thread.
Line~34 acquires \co{gblcnt_mutex} and line~37 releases it.
Line~35 invokes \co{globalize_count()} to clear out this thread's
counter state, and line~36 clears this thread's entry in the
\co{counterp[]} array.

\subsection{Simple Limit Counter Discussion}
\label{sec:count:Simple Limit Counter Discussion}

This type of counter is quite fast when aggregate values are near zero,
with some overhead due to the comparison and branch in both
\co{add_count()}'s and \co{sub_count()}'s fastpaths.
However, the use of a per-thread \co{countermax} reserve means that
\co{add_count()} can fail even when
the aggregate value of the counter is nowhere near \co{globalcountmax}.
Similarly, \co{sub_count()} can fail
even when the aggregate value of the counter is nowhere near zero.

In many cases, this is unacceptable.
Even if the \co{globalcountmax} is intended to be an approximate limit,
there is usually a limit to exactly how much approximation can be tolerated.
One way to limit the degree of approximation is to impose an upper limit
on the value of the per-thread \co{countermax} instances.
This task is undertaken in the next section.

\subsection{Approximate Limit Counter Implementation}
\label{sec:count:Approximate Limit Counter Implementation}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 unsigned long __thread counter = 0;
  2 unsigned long __thread countermax = 0;
  3 unsigned long globalcountmax = 10000;
  4 unsigned long globalcount = 0;
  5 unsigned long globalreserve = 0;
  6 unsigned long *counterp[NR_THREADS] = { NULL };
  7 DEFINE_SPINLOCK(gblcnt_mutex);
  8 #define MAX_COUNTERMAX 100
\end{verbatim}
}
\caption{Approximate Limit Counter Variables}
\label{fig:count:Approximate Limit Counter Variables}
\end{figure}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 static void balance_count(void)
  2 {
  3   countermax = globalcountmax -
  4                globalcount - globalreserve;
  5   countermax /= num_online_threads();
  6   if (countermax > MAX_COUNTERMAX)
  7     countermax = MAX_COUNTERMAX;
  8   globalreserve += countermax;
  9   counter = countermax / 2;
 10   if (counter > globalcount)
 11     counter = globalcount;
 12   globalcount -= counter;
 13 }
\end{verbatim}
}
\caption{Approximate Limit Counter Balancing}
\label{fig:count:Approximate Limit Counter Balancing}
\end{figure}

Because this implementation (\url{count_lim_app.c}) is quite similar to
that in the previous section
(Figures~\ref{fig:count:Simple Limit Counter Variables},
\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}, and
\ref{fig:count:Simple Limit Counter Utility Functions}),
only the changes are shown here.
Figure~\ref{fig:count:Approximate Limit Counter Variables}
is identical to
Figure~\ref{fig:count:Simple Limit Counter Variables},
with the addition of \co{MAX_COUNTERMAX}, which sets the maximum
permissible value of the per-thread \co{countermax} variable.

Similarly,
Figure~\ref{fig:count:Approximate Limit Counter Balancing}
is identical to the \co{balance_count()} function in
Figure~\ref{fig:count:Simple Limit Counter Utility Functions},
with the addition of lines~6 and 7, which enforce the
\co{MAX_COUNTERMAX} limit on the per-thread \co{countermax} variable.

\subsection{Approximate Limit Counter Discussion}

These changes greatly reduce the limit inaccuracy seen in the previous version,
but present another problem: any given value of \co{MAX_COUNTERMAX}
will cause a workload-dependent fraction of accesses to fall off the
fastpath.
As the number of threads increase, non-fastpath execution will become both
a performance and a scalability problem.
However, we will defer this problem and turn instead to counters
with exact limits.

\section{Exact Limit Counters}
\label{sec:count:Exact Limit Counters}

To solve the exact structure-allocation limit problem noted in the
Quick Quiz on
page~\pageref{chp:Counting},
we need a limit counter that can tell exactly when its limits are
exceeded.
One way of implementing such a limit counter is to
cause threads that have reserved counts to give them up.
One way to do this is to use atomic instructions.
Of course, atomic instructions will slow down the fastpath, but on the
other hand, it would be silly not to at least give them a try.

\subsection{Atomic Limit Counter Implementation}
\label{sec:count:Atomic Limit Counter Implementation}

Unfortunately,
if one thread is to safely remove counts from another thread,
both threads will need to atomically manipulate that thread's
that thread's \co{counter} and \co{countermax} variables.
The usual way to do this is to combine these two variables into a
single variable,
for example, given a 32-bit variable, using the high-order 16 bits to
represent \co{counter} and the low-order 16 bits to represent
\co{countermax}.

\QuickQuiz{}
	Why is it necessary to atomically manipulate the thread's
	\co{counter} and \co{countermax} variables as a unit?
	Wouldn't it be good enough to atomically manipulate them
	individually?
\QuickQuizAnswer{
	This might well be possible, but great care is required.
	Note that removing \co{counter} without first zeroing
	\co{countermax} could result in the corresponding thread
	increasing \co{counter} immediately after it was zeroed,
	completely negating the effect of zeroing the counter.

	The opposite ordering, namely zeroing \co{countermax} and then
	removing \co{counter}, can also result in a non-zero
	\co{counter}.
	To see this, consider the following sequence of events:

	\begin{enumerate}
	\item	Thread~A fetches its \co{countermax}, and finds that
		it is non-zero.
	\item	Thread~B zeroes Thread~A's \co{countermax}.
	\item	Thread~B removes Thread~A's \co{counter}.
	\item	Thread~A, having found that its \co{countermax}
		is non-zero, proceeds to add to its \co{counter},
		resulting in a non-zero value for \co{counter}.
	\end{enumerate}

	Again, it might well be possible to atomically manipulate
	\co{countermax} and \co{counter} as separate variables,
	but it is clear that great care is required.
	It is also quite likely that doing so will slow down the
	fastpath.

	Exploring these possibilities are left as exercises for
	the reader.
} \QuickQuizEnd

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 atomic_t __thread ctrandmax = ATOMIC_INIT(0);
  2 unsigned long globalcountmax = 10000;
  3 unsigned long globalcount = 0;
  4 unsigned long globalreserve = 0;
  5 atomic_t *counterp[NR_THREADS] = { NULL };
  6 DEFINE_SPINLOCK(gblcnt_mutex);
  7 #define CM_BITS (sizeof(atomic_t) * 4)
  8 #define MAX_COUNTERMAX ((1 << CM_BITS) - 1)
  9 
 10 static void
 11 split_ctrandmax_int(int cami, int *c, int *cm)
 12 {
 13   *c = (cami >> CM_BITS) & MAX_COUNTERMAX;
 14   *cm = cami & MAX_COUNTERMAX;
 15 }
 16 
 17 static void
 18 split_ctrandmax(atomic_t *cam, int *old,
 19                     int *c, int *cm)
 20 {
 21   unsigned int cami = atomic_read(cam);
 22 
 23   *old = cami;
 24   split_ctrandmax_int(cami, c, cm);
 25 }
 26 
 27 static int merge_ctrandmax(int c, int cm)
 28 {
 29   unsigned int cami;
 30 
 31   cami = (c << CM_BITS) | cm;
 32   return ((int)cami);
 33 }
\end{verbatim}
}
\caption{Atomic Limit Counter Variables and Access Functions}
\label{fig:count:Atomic Limit Counter Variables and Access Functions}
\end{figure}

The variables and access functions for a simple atomic limit counter
are shown in
Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}
(\url{count_lim_atomic.c}).
The \co{counter} and \co{countermax} variables in earlier algorithms
are combined into the single variable \co{ctrandmax} shown on
line~1, with \co{counter} in the upper half and \co{countermax} in
the lower half.
This variable is of type \co{atomic_t}, which has an underlying
representation of \co{int}.

Lines~2-6 show the definitions for \co{globalcountmax}, \co{globalcount},
\co{globalreserve}, \co{counterp}, and \co{gblcnt_mutex}, all of which
take on roles similar to their counterparts in
Figure~\ref{fig:count:Approximate Limit Counter Variables}.
Line~7 defines \co{CM_BITS}, which gives the number of bits in each half
of \co{ctrandmax}, and line~8 defines \co{MAX_COUNTERMAX}, which
gives the maximum value that may be held in either half of
\co{ctrandmax}.

\QuickQuiz{}
	In what way does line~7 of
	Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}
	violate the C standard?
\QuickQuizAnswer{
	It assumes eight bits per byte.
	This assumption does hold for all current commodity microprocessors
	that can be easily assembled into shared-memory multiprocessors,
	but certainly does not hold for all computer systems that have
	ever run C code.
	(What could you do instead in order to comply with the C
	standard?  What drawbacks would it have?)
} \QuickQuizEnd

Lines~10-15 show the \co{split_ctrandmax_int()} function, which,
when given the underlying \co{int} from the \co{atomic_t
ctrandmax} variable, splits it into its \co{counter} (\co{c})
and \co{countermax} (\co{cm}) components.
Line~13 isolates the most-significant half of this \co{int},
placing the result as specified by argument \co{c},
and line~14 isolates the least-significant half of this \co{int},
placing the result as specified by argument \co{cm}.

Lines~17-25 show the \co{split_ctrandmax()} function, which
picks up the underlying \co{int} from the specified variable
on line~21, stores it as specified by the \co{old} argument on
line~23, and then invokes \co{split_ctrandmax_int()} to split
it on line~24.

\QuickQuiz{}
	Given that there is only one \co{ctrandmax} variable,
	why bother passing in a pointer to it on line~18 of
	Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}?
\QuickQuizAnswer{
	There is only one \co{ctrandmax} variable \emph{per thread}.
	Later, we will see code that needs to pass other threads'
	\co{ctrandmax} variables to \co{split_ctrandmax()}.
} \QuickQuizEnd

Lines~27-33 show the \co{merge_ctrandmax()} function, which
can be thought of as the inverse of \co{split_ctrandmax()}.
Line~31 merges the \co{counter} and \co{countermax}
values passed in \co{c} and \co{cm}, respectively, and returns
the result.

\QuickQuiz{}
	Why does \co{merge_ctrandmax()} in
	Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}
	return an \co{int} rather than storing directly into an
	\co{atomic_t}?
\QuickQuizAnswer{
	Later, we will see that we need the \co{int} return to pass
	to the \co{atomic_cmpxchg()} primitive.
} \QuickQuizEnd

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 int add_count(unsigned long delta)
  2 {
  3   int c;
  4   int cm;
  5   int old;
  6   int new;
  7 
  8   do {
  9     split_ctrandmax(&ctrandmax, &old, &c, &cm);
 10     if (delta > MAX_COUNTERMAX || c + delta > cm)
 11       goto slowpath;
 12     new = merge_ctrandmax(c + delta, cm);
 13   } while (atomic_cmpxchg(&ctrandmax,
 14                           old, new) != old);
 15   return 1;
 16 slowpath:
 17   spin_lock(&gblcnt_mutex);
 18   globalize_count();
 19   if (globalcountmax - globalcount -
 20       globalreserve < delta) {
 21     flush_local_count();
 22     if (globalcountmax - globalcount -
 23         globalreserve < delta) {
 24       spin_unlock(&gblcnt_mutex);
 25       return 0;
 26     }
 27   }
 28   globalcount += delta;
 29   balance_count();
 30   spin_unlock(&gblcnt_mutex);
 31   return 1;
 32 }
 33 
 34 int sub_count(unsigned long delta)
 35 {
 36   int c;
 37   int cm;
 38   int old;
 39   int new;
 40 
 41   do {
 42     split_ctrandmax(&ctrandmax, &old, &c, &cm);
 43     if (delta > c)
 44       goto slowpath;
 45     new = merge_ctrandmax(c - delta, cm);
 46   } while (atomic_cmpxchg(&ctrandmax,
 47                           old, new) != old);
 48   return 1;
 49 slowpath:
 50   spin_lock(&gblcnt_mutex);
 51   globalize_count();
 52   if (globalcount < delta) {
 53     flush_local_count();
 54     if (globalcount < delta) {
 55       spin_unlock(&gblcnt_mutex);
 56       return 0;
 57     }
 58   }
 59   globalcount -= delta;
 60   balance_count();
 61   spin_unlock(&gblcnt_mutex);
 62   return 1;
 63 }
\end{verbatim}
}
\caption{Atomic Limit Counter Add and Subtract}
\label{fig:count:Atomic Limit Counter Add and Subtract}
\end{figure}

Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}
shows the \co{add_count()}, \co{sub_count()}, and
\co{read_count()} functions.

Lines~1-32 show \co{add_count()}, whose fastpath spans lines~8-15,
with the remainder of the function being the slowpath.
Lines~8-14 of the fastpath form a compare-and-swap (CAS) loop, with
the \co{atomic_cmpxchg()} primitives on lines~13-14 performing the
actual CAS.
Line~9 splits the current thread's \co{ctrandmax} variable into its
\co{counter} (in \co{c}) and \co{countermax} (in \co{cm}) components,
while placing the underlying \co{int} into \co{old}.
Line~10 checks whether the amount \co{delta} can be accommodated
locally (taking care to avoid integer overflow), and if not,
line~11 transfers to the slowpath.
Otherwise, line~11 combines an updated \co{counter} value with the
original \co{countermax} value into \co{new}.
The \co{atomic_cmpxchg()} primitive on lines~13-14 then atomically
compares this thread's \co{ctrandmax} variable to \co{old},
updating its value to \co{new} if the comparison succeeds.
If the comparison succeeds, line~15 returns success, otherwise,
execution continues in the loop at line~9.

\QuickQuiz{}
	Yecch!
	Why the ugly \co{goto} on line~11 of
	Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}?
	Haven't you heard of the \co{break} statement???
\QuickQuizAnswer{
	Replacing the \co{goto} with a \co{break} would require keeping
	a flag to determine whether or not line~15 should return, which
	is not the sort of thing you want on a fastpath.
	If you really hate the \co{goto} that much, your best bet would
	be to pull the fastpath into a separate function that returned
	success or failure, with ``failure'' indicating a need for the
	slowpath.
	This is left as an exercise for goto-hating readers.
} \QuickQuizEnd

\QuickQuiz{}
	Why would the \co{atomic_cmpxchg()} primitive at lines~13-14 of
	Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}
	ever fail?
	After all, we picked up its old value on line~9 and have not
	changed it!
\QuickQuizAnswer{
	Later, we will see how the \co{flush_local_count()} function in
	Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
	might update this thread's \co{ctrandmax} variable concurrently
	with the execution of the fastpath on lines~8-14 of
	Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}.
} \QuickQuizEnd

Lines~16-31 of
Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}
show \co{add_count()}'s slowpath, which is protected by \co{gblcnt_mutex},
which is acquired on line~17 and released on lines~24 and 30.
Line~18 invokes \co{globalize_count()}, which moves this thread's
state to the global counters.
Lines~19-20 check whether the \co{delta} value can be accommodated by
the current global state, and, if not, line~21 invokes
\co{flush_local_count()} to flush all threads' local state to the
global counters, and then lines~22-23 recheck whether \co{delta} can
be accommodated.
If, after all that, the addition of \co{delta} still cannot be accommodated,
then line~24 releases \co{gblcnt_mutex} (as noted earlier), and
then line~25 returns failure.

Otherwise, line~28 adds \co{delta} to the global counter, line~29
spreads counts to the local state if appropriate, line~30 releases
\co{gblcnt_mutex} (again, as noted earlier), and finally, line 31
returns success.

Lines~34-63 of
Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}
show \co{sub_count()}, which is structured similarly to
\co{add_count()}, having a fastpath on lines~41-48 and a slowpath on
lines~49-62.
A line-by-line analysis of this function is left as an exercise to
the reader.

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 unsigned long read_count(void)
  2 {
  3   int c;
  4   int cm;
  5   int old;
  6   int t;
  7   unsigned long sum;
  8 
  9   spin_lock(&gblcnt_mutex);
 10   sum = globalcount;
 11   for_each_thread(t)
 12     if (counterp[t] != NULL) {
 13       split_ctrandmax(counterp[t], &old, &c, &cm);
 14       sum += c;
 15     }
 16   spin_unlock(&gblcnt_mutex);
 17   return sum;
 18 }
\end{verbatim}
}
\caption{Atomic Limit Counter Read}
\label{fig:count:Atomic Limit Counter Read}
\end{figure}

Figure~\ref{fig:count:Atomic Limit Counter Read} shows \co{read_count()}.
Line~9 acquires \co{gblcnt_mutex} and line~16 releases it.
Line~10 initializes local variable \co{sum} to the value of
\co{globalcount}, and the loop spanning lines~11-15 adds the
per-thread counters to this sum, isolating each per-thread counter
using \co{split_ctrandmax} on line 13.
Finally, line~17 returns the sum.

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 static void globalize_count(void)
  2 {
  3   int c;
  4   int cm;
  5   int old;
  6 
  7   split_ctrandmax(&ctrandmax, &old, &c, &cm);
  8   globalcount += c;
  9   globalreserve -= cm;
 10   old = merge_ctrandmax(0, 0);
 11   atomic_set(&ctrandmax, old);
 12 }
 13 
 14 static void flush_local_count(void)
 15 {
 16   int c;
 17   int cm;
 18   int old;
 19   int t;
 20   int zero;
 21 
 22   if (globalreserve == 0)
 23     return;
 24   zero = merge_ctrandmax(0, 0);
 25   for_each_thread(t)
 26     if (counterp[t] != NULL) {
 27       old = atomic_xchg(counterp[t], zero);
 28       split_ctrandmax_int(old, &c, &cm);
 29       globalcount += c;
 30       globalreserve -= cm;
 31     }
 32 }
\end{verbatim}
}
\caption{Atomic Limit Counter Utility Functions 1}
\label{fig:count:Atomic Limit Counter Utility Functions 1}
\end{figure}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 static void balance_count(void)
  2 {
  3   int c;
  4   int cm;
  5   int old;
  6   unsigned long limit;
  7 
  8   limit = globalcountmax - globalcount -
  9           globalreserve;
 10   limit /= num_online_threads();
 11   if (limit > MAX_COUNTERMAX)
 12     cm = MAX_COUNTERMAX;
 13   else
 14     cm = limit;
 15   globalreserve += cm;
 16   c = cm / 2;
 17   if (c > globalcount)
 18     c = globalcount;
 19   globalcount -= c;
 20   old = merge_ctrandmax(c, cm);
 21   atomic_set(&ctrandmax, old);
 22 }
 23 
 24 void count_register_thread(void)
 25 {
 26   int idx = smp_thread_id();
 27 
 28   spin_lock(&gblcnt_mutex);
 29   counterp[idx] = &ctrandmax;
 30   spin_unlock(&gblcnt_mutex);
 31 }
 32 
 33 void count_unregister_thread(int nthreadsexpected)
 34 {
 35   int idx = smp_thread_id();
 36 
 37   spin_lock(&gblcnt_mutex);
 38   globalize_count();
 39   counterp[idx] = NULL;
 40   spin_unlock(&gblcnt_mutex);
 41 }
\end{verbatim}
}
\caption{Atomic Limit Counter Utility Functions 2}
\label{fig:count:Atomic Limit Counter Utility Functions 2}
\end{figure}

Figures~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
and~\ref{fig:count:Atomic Limit Counter Utility Functions 2}
shows the utility functions
\co{globalize_count()},
\co{flush_local_count()},
\co{balance_count()},
\co{count_register_thread()}, and
\co{count_unregister_thread()}.
The code for \co{globalize_count()} is shown on lines~1-12,
of Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1} and
is similar to that of previous algorithms, with the addition of
line~7, which is now required to split out \co{counter} and
\co{countermax} from \co{ctrandmax}.

The code for \co{flush_local_count()}, which moves all threads' local
counter state to the global counter, is shown on lines~14-32.
Line~22 checks to see if the value of \co{globalreserve} permits
any per-thread counts, and, if not, line~23 returns.
Otherwise, line~24 initializes local variable \co{zero} to a combined
zeroed \co{counter} and \co{countermax}.
The loop spanning lines~25-31 sequences through each thread.
Line~26 checks to see if the current thread has counter state,
and, if so, lines~27-30 move that state to the global counters.
Line~27 atomically fetches the current thread's state
while replacing it with zero.
Line~28 splits this state into its \co{counter} (in local variable \co{c})
and \co{countermax} (in local variable \co{cm}) components.
Line~29 adds this thread's \co{counter} to \co{globalcount}, while
line~30 subtracts this thread's \co{countermax} from \co{globalreserve}.

\QuickQuiz{}
	What stops a thread from simply refilling its
	\co{ctrandmax} variable immediately after
	\co{flush_local_count()} on line 14 of
	Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
	empties it?
\QuickQuizAnswer{
	This other thread cannot refill its \co{ctrandmax}
	until the caller of \co{flush_local_count()} releases the
	\co{gblcnt_mutex}.
	By that time, the caller of \co{flush_local_count()} will have
	finished making use of the counts, so there will be no problem
	with this other thread refilling --- assuming that the value
	of \co{globalcount} is large enough to permit a refill.
} \QuickQuizEnd

\QuickQuiz{}
	What prevents concurrent execution of the fastpath of either
	\co{atomic_add()} or \co{atomic_sub()} from interfering with
	the \co{ctrandmax} variable while
	\co{flush_local_count()} is accessing it on line 27 of
	Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
	empties it?
\QuickQuizAnswer{
	Nothing.
	Consider the following three cases:
	\begin{enumerate}
	\item	If \co{flush_local_count()}'s \co{atomic_xchg()} executes
		before the \co{split_ctrandmax()} of either fastpath,
		then the fastpath will see a zero \co{counter} and
		\co{countermax}, and will thus transfer to the slowpath
		(unless of course \co{delta} is zero).
	\item	If \co{flush_local_count()}'s \co{atomic_xchg()} executes
		after the \co{split_ctrandmax()} of either fastpath,
		but before that fastpath's \co{atomic_cmpxchg()},
		then the \co{atomic_cmpxchg()} will fail, causing the
		fastpath to restart, which reduces to case~1 above.
	\item	If \co{flush_local_count()}'s \co{atomic_xchg()} executes
		after the \co{atomic_cmpxchg()} of either fastpath,
		then the fastpath will (most likely) complete successfully
		before \co{flush_local_count()} zeroes the thread's
		\co{ctrandmax} variable.
	\end{enumerate}
	Either way, the race is resolved correctly.
} \QuickQuizEnd

Lines~1-22 on
Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 2}
show the code for \co{balance_count()}, which refills
the calling thread's local \co{ctrandmax} variable.
This function is quite similar to that of the preceding algorithms,
with changes required to handle the merged \co{ctrandmax} variable.
Detailed analysis of the code is left as an exercise for the reader,
as it is with the \co{count_register_thread()} function starting on
line~24 and the \co{count_unregister_thread()} function starting on
line~33.

\QuickQuiz{}
	Given that the \co{atomic_set()} primitive does a simple
	store to the specified \co{atomic_t}, how can line~21 of
	\co{balance_count()} in
	Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 2}
	work correctly in face of concurrent \co{flush_local_count()}
	updates to this variable?
\QuickQuizAnswer{
	The caller of both \co{balance_count()} and
	\co{flush_local_count()} hold \co{gblcnt_mutex}, so
	only one may be executing at a given time.
} \QuickQuizEnd

The next section qualitatively evaluates this design.

\subsection{Atomic Limit Counter Discussion}

This is the first implementation that actually allows the counter to
be run all the way to either of its limits, but it does so at the
expense of adding atomic operations to the fastpaths, which slow down
the fastpaths significantly on some systems.
Although some workloads might tolerate this slowdown, it is worthwhile
looking for algorithms with better read-side performance.
One such algorithm uses a signal handler to steal counts from other
threads.
Because signal handlers run in the context of the signaled thread,
atomic operations are not necessary, as shown in the next section.

\QuickQuiz{}
	But signal handlers can be migrated to some other
	CPU while running.
	Doesn't this possibility require that atomic instructions
	and memory barriers are required to reliably communicate
	between a thread and a signal handler that interrupts that
	thread?
\QuickQuizAnswer{
	No.
	If the signal handler is migrated to another CPU, then the
	interrupted thread is also migrated along with it.
} \QuickQuizEnd

\subsection{Signal-Theft Limit Counter Design}
\label{sec:count:Signal-Theft Limit Counter Design}

\begin{figure}[tb]
\begin{center}
\resizebox{2in}{!}{\includegraphics{count/sig-theft}}
\end{center}
\caption{Signal-Theft State Machine}
\label{fig:count:Signal-Theft State Machine}
\end{figure}

Even though per-thread state will now be manipulated only by the
corresponding thread, there will still need to be synchronization
with the signal handlers.
This synchronization is provided by the state machine shown in
Figure~\ref{fig:count:Signal-Theft State Machine}
The state machine starts out in the IDLE state, and when \co{add_count()}
or \co{sub_count()} find that the combination of the local thread's count
and the global count cannot accommodate the request, the corresponding
slowpath sets each thread's \co{theft} state to REQ (unless that thread
has no count, in which case it transitions directly to READY).
Only the slowpath, which holds the \co{gblcnt_mutex} lock, is permitted to
transition from the IDLE state, as indicated by the green color.
The slowpath then sends a signal to each thread, and the corresponding
signal handler checks the corresponding thread's \co{theft} and
\co{counting} variables.
If the \co{theft} state is not REQ, then the signal handler is not
permitted to change the state, and therefore simply returns.
Otherwise, if the \co{counting} variable is set, indicating that
the current thread's fastpath is in progress, the signal handler
sets the \co{theft} state to ACK, otherwise to READY.

If the \co{theft} state is ACK,
only the fastpath is permitted to change
the \co{theft} state, as indicated by the blue color.
When the fastpath completes, it sets the \co{theft} state to READY.

Once the slowpath sees a thread's \co{theft} state is READY, the
slowpath is permitted to steal that thread's count.
The slowpath then sets that thread's \co{theft} state to IDLE.

\QuickQuiz{}
	In Figure~\ref{fig:count:Signal-Theft State Machine}, why is
	the REQ \co{theft} state colored red?
\QuickQuizAnswer{
	To indicate that only the fastpath is permitted to change the
	\co{theft} state, and that if the thread remains in this
	state for too long, the thread running the slowpath will
	resend the POSIX signal.
} \QuickQuizEnd

\QuickQuiz{}
	In Figure~\ref{fig:count:Signal-Theft State Machine}, what is
	the point of having separate REQ and ACK \co{theft} states?
	Why not simplify the state machine by collapsing
	them into a single REQACK state?
	Then whichever of the signal handler or the fastpath gets there
	first could set the state to READY.
\QuickQuizAnswer{
	Reasons why collapsing the REQ and ACK states would be a very
	bad idea include:
	\begin{enumerate}
	\item	The slowpath uses the REQ and ACK states to determine
		whether the signal should be retransmitted.
		If the states were collapsed, the slowpath would have
		no choice but to send redundant signals, which would
		have the unhelpful effect of needlessly slowing down
		the fastpath.
	\item	The following race would result:
		\begin{enumerate}
		\item	The slowpath sets a given thread's state to REQACK.
		\item	That thread has just finished its fastpath, and
			notes the REQACK state.
		\item	The thread receives the signal, which also notes
			the REQACK state, and, because there is no fastpath
			in effect, sets the state to READY.
		\item	The slowpath notes the READY state, steals the
			count, and sets the state to IDLE, and completes.
		\item	The fastpath sets the state to READY, disabling
			further fastpath execution for this thread.
		\end{enumerate}
		The basic problem here is that the combined REQACK state
		can be referenced by both the signal handler and the
		fastpath.
		The clear separation maintained by the four-state
		setup ensures orderly state transitions.
	\end{enumerate}
	That said, you might well be able to make a three-state setup
	work correctly.
	If you do succeed, compare carefully to the four-state setup.
	Is the three-state solution really preferable, and why or why not?
} \QuickQuizEnd

\subsection{Signal-Theft Limit Counter Implementation}
\label{sec:count:Signal-Theft Limit Counter Implementation}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 #define THEFT_IDLE  0
  2 #define THEFT_REQ   1
  3 #define THEFT_ACK   2
  4 #define THEFT_READY 3
  5 
  6 int __thread theft = THEFT_IDLE;
  7 int __thread counting = 0;
  8 unsigned long __thread counter = 0;
  9 unsigned long __thread countermax = 0;
 10 unsigned long globalcountmax = 10000;
 11 unsigned long globalcount = 0;
 12 unsigned long globalreserve = 0;
 13 unsigned long *counterp[NR_THREADS] = { NULL };
 14 unsigned long *countermaxp[NR_THREADS] = { NULL };
 15 int *theftp[NR_THREADS] = { NULL };
 16 DEFINE_SPINLOCK(gblcnt_mutex);
 17 #define MAX_COUNTERMAX 100
\end{verbatim}
}
\caption{Signal-Theft Limit Counter Data}
\label{fig:count:Signal-Theft Limit Counter Data}
\end{figure}

Figure~\ref{fig:count:Signal-Theft Limit Counter Data}
(\url{count_lim_sig.c})
shows the data structures used by the signal-theft based counter
implementation.
Lines~1-7 define the states and values for the per-thread theft state machine
described in the preceding section.
Lines~8-17 are similar to earlier implementations, with the addition of
lines~14 and 15 to allow remote access to a thread's \co{countermax}
and \co{theft} variables, respectively.

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 static void globalize_count(void)
  2 {
  3   globalcount += counter;
  4   counter = 0;
  5   globalreserve -= countermax;
  6   countermax = 0;
  7 }
  8 
  9 static void flush_local_count_sig(int unused)
 10 {
 11   if (ACCESS_ONCE(theft) != THEFT_REQ)
 12     return;
 13   smp_mb();
 14   ACCESS_ONCE(theft) = THEFT_ACK;
 15   if (!counting) {
 16     ACCESS_ONCE(theft) = THEFT_READY;
 17   }
 18   smp_mb();
 19 }
 20 
 21 static void flush_local_count(void)
 22 {
 23   int t;
 24   thread_id_t tid;
 25 
 26   for_each_tid(t, tid)
 27     if (theftp[t] != NULL) {
 28       if (*countermaxp[t] == 0) {
 29         ACCESS_ONCE(*theftp[t]) = THEFT_READY;
 30         continue;
 31       }
 32       ACCESS_ONCE(*theftp[t]) = THEFT_REQ;
 33       pthread_kill(tid, SIGUSR1);
 34     }
 35   for_each_tid(t, tid) {
 36     if (theftp[t] == NULL)
 37       continue;
 38     while (ACCESS_ONCE(*theftp[t]) != THEFT_READY) {
 39       poll(NULL, 0, 1);
 40       if (ACCESS_ONCE(*theftp[t]) == THEFT_REQ)
 41         pthread_kill(tid, SIGUSR1);
 42     }
 43     globalcount += *counterp[t];
 44     *counterp[t] = 0;
 45     globalreserve -= *countermaxp[t];
 46     *countermaxp[t] = 0;
 47     ACCESS_ONCE(*theftp[t]) = THEFT_IDLE;
 48   }
 49 }
 50 
 51 static void balance_count(void)
 52 {
 53   countermax = globalcountmax -
 54     globalcount - globalreserve;
 55   countermax /= num_online_threads();
 56   if (countermax > MAX_COUNTERMAX)
 57     countermax = MAX_COUNTERMAX;
 58   globalreserve += countermax;
 59   counter = countermax / 2;
 60   if (counter > globalcount)
 61     counter = globalcount;
 62   globalcount -= counter;
 63 }
\end{verbatim}
}
\caption{Signal-Theft Limit Counter Value-Migration Functions}
\label{fig:count:Signal-Theft Limit Counter Value-Migration Functions}
\end{figure}

Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}
shows the functions responsible for migrating counts between per-thread
variables and the global variables.
Lines~1-7 shows \co{global_count()}, which is identical to earlier
implementations.
Lines~9-19 shows \co{flush_local_count_sig()}, which is the signal
handler used in the theft process.
Lines~11 and 12 check to see if the \co{theft} state is REQ, and, if not
returns without change.
Line~13 executes a memory barrier to ensure that the sampling of the
theft variable happens before any change to that variable.
Line~14 sets the \co{theft} state to ACK, and, if line~15 sees that
this thread's fastpaths are not running, line~16 sets the \co{theft}
state to READY.

\QuickQuiz{}
	In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}
	function \co{flush_local_count_sig()}, why are there
	\co{ACCESS_ONCE()} wrappers around the uses of the
	\co{theft} per-thread variable?
\QuickQuizAnswer{
	The first one (on line~11) can be argued to be unnecessary.
	The last two (lines~14 and 16) are important.
	If these are removed, the compiler would be within its rights
	to rewrite lines~14-17 as follows:
	\vspace{5pt}
	\begin{minipage}[t]{\columnwidth}
	\small
	\begin{verbatim}
 14   theft = THEFT_READY;
 15   if (counting) {
 16     theft = THEFT_ACK;
 17   }
	\end{verbatim}
	\end{minipage}
	\vspace{5pt}
	This would be fatal, as the slowpath might see the transient
	value of \co{THEFT_READY}, and start stealing before the
	corresponding thread was ready.
} \QuickQuizEnd

Lines~21-49 shows \co{flush_local_count()}, which is called from the
slowpath to flush all threads' local counts.
The loop spanning lines~26-34 advances the \co{theft} state for each
thread that has local count, and also sends that thread a signal.
Line~27 skips any non-existent threads.
Otherwise, line~28 checks to see if the current thread holds any local
count, and, if not, line~29 sets the thread's \co{theft} state to READY
and line~30 skips to the next thread.
Otherwise, line~32 sets the thread's \co{theft} state to REQ and
line~33 sends the thread a signal.

\QuickQuiz{}
	In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
	why is it safe for line~28 to directly access the other thread's
	\co{countermax} variable?
\QuickQuizAnswer{
	Because the other thread is not permitted to change the value
	of its \co{countermax} variable unless it holds the
	\co{gblcnt_mutex} lock.
	But the caller has acquired this lock, so it is not possible
	for the other thread to hold it, and therefore the other thread
	is not permitted to change its \co{countermax} variable.
	We can therefore safely access it --- but not change it.
} \QuickQuizEnd

\QuickQuiz{}
	In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
	why doesn't line~33 check for the current thread sending itself
	a signal?
\QuickQuizAnswer{
	There is no need for an additional check.
	The caller of \co{flush_local_count()} has already invoked
	\co{globalize_count()}, so the check on line~28 will have
	succeeded, skipping the later \co{pthread_kill()}.
} \QuickQuizEnd

\QuickQuiz{}
	The code in
	Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
	works with gcc and POSIX.
	What would be required to make it also conform to the ISO C standard?
\QuickQuizAnswer{
	The \co{theft} variable must be of type \co{sig_atomic_t}
	to guarantee that it can be safely shared between the signal
	handler and the code interrupted by the signal.
} \QuickQuizEnd

The loop spanning lines~35-48 waits until each thread reaches READY state,
then steals that thread's count.
Lines~36-37 skip any non-existent threads, and the loop spanning
lines~38-42 wait until the current thread's \co{theft} state becomes READY.
Line~39 blocks for a millisecond to avoid priority-inversion problems,
and if line~40 determines that the thread's signal has not yet arrived,
line~41 resends the signal.
Execution reaches line~43 when the thread's \co{theft} state becomes
READY, so lines~43-46 do the thieving.
Line~47 then sets the thread's \co{theft} state back to IDLE.

\QuickQuiz{}
	In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}, why does line~41 resend the signal?
\QuickQuizAnswer{
	Because many operating systems over several decades have
	had the property of losing the occasional signal.
	Whether this is a feature or a bug is debatable, but
	irrelevant.
	The obvious symptom from the user's viewpoint will not be
	a kernel bug, but rather a user application hanging.

	\emph{Your} user application hanging!
} \QuickQuizEnd

Lines~51-63 show \co{balance_count()}, which is similar to that of
earlier examples.

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 int add_count(unsigned long delta)
  2 {
  3   int fastpath = 0;
  4 
  5   counting = 1;
  6   barrier();
  7   if (countermax - counter >= delta &&
  8       ACCESS_ONCE(theft) <= THEFT_REQ) {
  9     counter += delta;
 10     fastpath = 1;
 11   }
 12   barrier();
 13   counting = 0;
 14   barrier();
 15   if (ACCESS_ONCE(theft) == THEFT_ACK) {
 16     smp_mb();
 17     ACCESS_ONCE(theft) = THEFT_READY;
 18   }
 19   if (fastpath)
 20     return 1;
 21   spin_lock(&gblcnt_mutex);
 22   globalize_count();
 23   if (globalcountmax - globalcount -
 24       globalreserve < delta) {
 25     flush_local_count();
 26     if (globalcountmax - globalcount -
 27         globalreserve < delta) {
 28       spin_unlock(&gblcnt_mutex);
 29       return 0;
 30     }
 31   }
 32   globalcount += delta;
 33   balance_count();
 34   spin_unlock(&gblcnt_mutex);
 35   return 1;
 36 }
\end{verbatim}
}
\caption{Signal-Theft Limit Counter Add Function}
\label{fig:count:Signal-Theft Limit Counter Add Function}
\end{figure}

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
 38 int sub_count(unsigned long delta)
 39 {
 40   int fastpath = 0;
 41 
 42   counting = 1;
 43   barrier();
 44   if (counter >= delta &&
 45       ACCESS_ONCE(theft) <= THEFT_REQ) {
 46     counter -= delta;
 47     fastpath = 1;
 48   }
 49   barrier();
 50   counting = 0;
 51   barrier();
 52   if (ACCESS_ONCE(theft) == THEFT_ACK) {
 53     smp_mb();
 54     ACCESS_ONCE(theft) = THEFT_READY;
 55   }
 56   if (fastpath)
 57     return 1;
 58   spin_lock(&gblcnt_mutex);
 59   globalize_count();
 60   if (globalcount < delta) {
 61     flush_local_count();
 62     if (globalcount < delta) {
 63       spin_unlock(&gblcnt_mutex);
 64       return 0;
 65     }
 66   }
 67   globalcount -= delta;
 68   balance_count();
 69   spin_unlock(&gblcnt_mutex);
 70   return 1;
 71 }
\end{verbatim}
}
\caption{Signal-Theft Limit Counter Subtract Function}
\label{fig:count:Signal-Theft Limit Counter Subtract Function}
\end{figure}

Figure~\ref{fig:count:Signal-Theft Limit Counter Add Function}
shows the \co{add_count()} function.
The fastpath spans lines~5-20, and the slowpath lines~21-35.
Line~5 sets the per-thread \co{counting} variable to 1 so that
any subsequent signal handlers interrupting this thread will
set the \co{theft} state to ACK rather than READY, allowing this
fastpath to complete properly.
Line~6 prevents the compiler from reordering any of the fastpath body
to precede the setting of \co{counting}.
Lines~7 and 8 check to see if the per-thread data can accommodate
the \co{add_count()} and if there is no ongoing theft in progress,
and if so line~9 does the fastpath addition and line~10 notes that
the fastpath was taken.

In either case, line~12 prevents the compiler from reordering the
fastpath body to follow line~13, which permits any subsequent signal
handlers to undertake theft.
Line~14 again disables compiler reordering, and then line~15
checks to see if the signal handler deferred the \co{theft}
state-change to READY, and, if so, line~16 executes a memory
barrier to ensure that any CPU that sees line~17 setting state to
READY also sees the effects of line~9.
If the fastpath addition at line~9 was executed, then line~20 returns
success.

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 unsigned long read_count(void)
  2 {
  3   int t;
  4   unsigned long sum;
  5 
  6   spin_lock(&gblcnt_mutex);
  7   sum = globalcount;
  8   for_each_thread(t)
  9     if (counterp[t] != NULL)
 10       sum += *counterp[t];
 11   spin_unlock(&gblcnt_mutex);
 12   return sum;
 13 }
\end{verbatim}
}
\caption{Signal-Theft Limit Counter Read Function}
\label{fig:count:Signal-Theft Limit Counter Read Function}
\end{figure}

Otherwise, we fall through to the slowpath starting at line~21.
The structure of the slowpath is similar to those of earlier examples,
so its analysis is left as an exercise to the reader.
Similarly, the structure of \co{sub_count()} on
Figure~\ref{fig:count:Signal-Theft Limit Counter Subtract Function}
is the same
as that of \co{add_count()}, so the analysis of \co{sub_count()} is also
left as an exercise for the reader, as is the analysis of
\co{read_count()} in
Figure~\ref{fig:count:Signal-Theft Limit Counter Read Function}.

\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
  1 void count_init(void)
  2 {
  3   struct sigaction sa;
  4 
  5   sa.sa_handler = flush_local_count_sig;
  6   sigemptyset(&sa.sa_mask);
  7   sa.sa_flags = 0;
  8   if (sigaction(SIGUSR1, &sa, NULL) != 0) {
  9     perror("sigaction");
 10     exit(-1);
 11   }
 12 }
 13 
 14 void count_register_thread(void)
 15 {
 16   int idx = smp_thread_id();
 17 
 18   spin_lock(&gblcnt_mutex);
 19   counterp[idx] = &counter;
 20   countermaxp[idx] = &countermax;
 21   theftp[idx] = &theft;
 22   spin_unlock(&gblcnt_mutex);
 23 }
 24 
 25 void count_unregister_thread(int nthreadsexpected)
 26 {
 27   int idx = smp_thread_id();
 28 
 29   spin_lock(&gblcnt_mutex);
 30   globalize_count();
 31   counterp[idx] = NULL;
 32   countermaxp[idx] = NULL;
 33   theftp[idx] = NULL;
 34   spin_unlock(&gblcnt_mutex);
 35 }
\end{verbatim}
}
\caption{Signal-Theft Limit Counter Initialization Functions}
\label{fig:count:Signal-Theft Limit Counter Initialization Functions}
\end{figure}

Lines~1-12 of
Figure~\ref{fig:count:Signal-Theft Limit Counter Initialization Functions}
show \co{count_init()}, which set up \co{flush_local_count_sig()}
as the signal handler for \co{SIGUSR1},
enabling the \co{pthread_kill()} calls in \co{flush_local_count()}
to invoke \co{flush_local_count_sig()}.
The code for thread registry and unregistry is similar to that of
earlier examples, so its analysis is left as an exercise for the
reader.

\subsection{Signal-Theft Limit Counter Discussion}

The signal-theft implementation runs more than twice as fast as the
atomic implementation on my Intel Core Duo laptop.
Is it always preferable?

The signal-theft implementation would be vastly preferable on Pentium-4
systems, given their slow atomic instructions, but the old 80386-based
Sequent Symmetry systems would do much better with the shorter path
length of the atomic implementation.
However, this increased update-side performance comes at the
prices of higher read-side overhead: Those POSIX signals are not free.
If ultimate performance is of the essence, you will need to measure
them both on the system that your application is to be deployed on.

\QuickQuiz{}
	Not only are POSIX signals slow, sending one to each thread
	simply does not scale.
	What would you do if you had (say) 10,000 threads and needed
	the read side to be fast?
\QuickQuizAnswer{
	One approach is to use the techniques shown in
	Section~\ref{sec:count:Eventually Consistent Implementation},
	summarizing an approximation to the overall counter value in
	a single variable.
	Another approach would be to use multiple threads to carry
	out the reads, with each such thread interacting with a
	specific subset of the updating threads.
} \QuickQuizEnd

This is but one reason why high-quality APIs are so important:
they permit implementations to be changed as required by ever-changing
hardware performance characteristics.

\QuickQuiz{}
	What if you want an exact limit counter to be exact only for
	its lower limit?
\QuickQuizAnswer{
	One simple solution is to overstate the upper limit by the
	desired amount.
	The limiting case of such overstatement results in the
	upper limit being set to the largest value that the counter is
	capable of representing.
} \QuickQuizEnd

\section{Applying Specialized Parallel Counters}
\label{sec:count:Applying Specialized Parallel Counters}

Although the exact limit counter implementations in
Section~\ref{sec:count:Exact Limit Counters}
can be very useful, they are not much help if the counter's value
remains near zero at all times, as it might when counting the number
of outstanding accesses to an I/O device.
The high overhead of such near-zero counting is especially painful
given that we normally don't care how many references there are.
As noted in the removable I/O device access-count problem on
page~\pageref{chp:Counting},
the number of accesses is irrelevant except in those rare cases when
someone is actually trying to remove the device.

One simple solution to this problem is to add a large ``bias''
(for example, one billion) to the
counter in order to ensure that the value is far enough from zero that
the counter can operate efficiently.
When someone wants to remove the device, this bias is subtracted from
the counter value.
Counting the last few accesses will be quite inefficient,
but the important point is that the many prior accesses will have been
counted at full speed.

\QuickQuiz{}
	What else had you better have done when using a biased counter?
\QuickQuizAnswer{
	You had better have set the upper limit to be large enough
	accommodate the bias, the expected maximum number of accesses,
	and enough ``slop'' to allow the counter to work efficiently
	even when the number of accesses is at its maximum.
} \QuickQuizEnd

Although a biased counter can be quite helpful and useful, it is only a
partial solution to the removable I/O device access-count problem
called out on
page~\pageref{chp:Counting}.
When attempting to remove a device, we must not only know the precise
number of current I/O accesses, we also need to prevent any future
accesses from starting.
One way to accomplish this is to read-acquire a reader-writer lock
when updating the counter, and to write-acquire that same reader-writer
lock when checking the counter.
Code for doing I/O might be as follows:

\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
  1 read_lock(&mylock);
  2 if (removing) {
  3   read_unlock(&mylock);
  4   cancel_io();
  5 } else {
  6   add_count(1);
  7   read_unlock(&mylock);
  8   do_io();
  9   sub_count(1);
 10 }
\end{verbatim}
\end{minipage}
\vspace{5pt}

Line~1 read-acquires the lock, and either line~3 or 7 releases it.
Line~2 checks to see if the device is being removed, and, if so,
line~3 releases the lock and line~4 cancels the I/O, or takes whatever
action is appropriate given that the device is to be removed.
Otherwise, line~6 increments the access count, line~7 releases the
lock, line~8 performs the I/O, and line~9 decrements the access count.

\QuickQuiz{}
	This is ridiculous!
	We are \emph{read}-acquiring a reader-writer lock to
	\emph{update} the counter?
	What are you playing at???
\QuickQuizAnswer{
	Strange, perhaps, but true!
	Almost enough to make you think that the name
	``reader-writer lock'' was poorly chosen, isn't it?
} \QuickQuizEnd

The code to remove the device might be as follows:

\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
  1 write_lock(&mylock);
  2 removing = 1;
  3 sub_count(mybias);
  4 write_unlock(&mylock);
  5 while (read_count() != 0) {
  6   poll(NULL, 0, 1);
  7 }
  8 remove_device();
\end{verbatim}
\end{minipage}
\vspace{5pt}

Line~1 write-acquires the lock and line~4 releases it.
Line~2 notes that the device is being removed, and the loop spanning
lines~5-7 wait for any I/O operations to complete.
Finally, line~8 does any additional processing needed to prepare for
device removal.

\QuickQuiz{}
	What other issues would need to be accounted for in a real system?
\QuickQuizAnswer{
	A huge number!

	Here are a few to start with:

	\begin{enumerate}
	\item	There could be any number of devices, so that the
		global variables are inappropriate, as are the
		lack of arguments to functions like \co{do_io()}.
	\item	Polling loops can be problematic in real systems.
		In many cases, it is far better to have the last
		completing I/O wake up the device-removal thread.
	\item	The I/O might fail, and so \co{do_io()} will likely
		need a return value.
	\item	If the device fails, the last I/O might never complete.
		In such cases, there might need to be some sort of
		timeout to allow error recovery.
	\item	Both \co{add_count()} and \co{sub_count()} can
		fail, but their return values are not checked.
	\item	Reader-writer locks do not scale well.
		One way of avoiding the high read-acquisition costs
		of reader-writer locks is presented in
		Chapters~\ref{chp:Locking}
		and~\ref{chp:Deferred Processing}.
	\item	The polling loops result in poor energy efficiency.
		An event-driven design is preferable.
	\end{enumerate}
} \QuickQuizEnd

\section{Parallel Counting Discussion}
\label{sec:count:Parallel Counting Discussion}

This chapter has presented the reliability, performance, and
scalability problems with traditional counting primitives.
The C-language \co{++} operator is not guaranteed to function reliably in
multithreaded code, and atomic operations to a single variable neither
perform nor scale well.
This chapter has also presented a number of counting algorithms that
perform and scale extremely well in certain special cases.

\begin{table*}
\begin{center}
\begin{tabular}{l|r|r|r|r}
	& & & \multicolumn{2}{|c}{Reads} \\
	\cline{4-5}
	Algorithm & Section & Updates & 1 Core & 32 Cores \\
	\hline
	\hline
	\url{count_stat.c} & \ref{sec:count:Array-Based Implementation} &
		11.5 ns & 408 ns & 409 ns \\
	\url{count_stat_eventual.c} & \ref{sec:count:Eventually Consistent Implementation} &
		11.6 ns & 1 ns & 1 ns \\
	\url{count_end.c} & \ref{sec:count:Per-Thread-Variable-Based Implementation} &
		6.3 ns & 389 ns & 51,200 ns \\
	\url{count_end_rcu.c} & \ref{sec:applyrcu:RCU and Per-Thread-Variable-Based Statistical Counters} &
		5.7 ns & 354 ns & 501 ns \\
\end{tabular}
\end{center}
\caption{Statistical Counter Performance on Power-6}
\label{tab:count:Statistical Counter Performance on Power-6}
\end{table*}

Table~\ref{tab:count:Statistical Counter Performance on Power-6}
shows the performance of the four parallel statistical counting
algorithms.
All four algorithms provide near-perfect linear scalability for updates.
The per-thread-variable implementation (\co{count_stat.c})
is significantly faster on
updates than the array-based implementation
(\co{count_end.c}), but is slower at reads,
and suffers severe lock contention when there are many parallel readers.
This contention can be addressed using the deferred-processing
techniques introduced in
Chapter~\ref{chp:Deferred Processing},
as shown on the \co{count_end_rcu.c} row of
Table~\ref{tab:count:Statistical Counter Performance on Power-6}.
Deferred processing also shines on the \co{count_stat_eventual.c} row,
courtesy of eventual consistency.

\QuickQuiz{}
	On the \url{count_stat.c} row of
	Table~\ref{tab:count:Statistical Counter Performance on Power-6},
	we see that the update side scales linearly with the number of
	threads.
	How is that possible given that the more threads there are,
	the more per-thread counters must be summed up?
\QuickQuizAnswer{
	The read-side code must scan the entire fixed-size array, regardless
	of the number of threads, so there is no difference in performance.
	In contrast, in the last two algorithms, readers must do more
	work when there are more threads.
	In addition, the last two algorithms interpose an additional
	level of indirection because they map from integer thread ID
	to the corresponding \co{__thread} variable.
} \QuickQuizEnd

\QuickQuiz{}
	Even on the last row of
	Table~\ref{tab:count:Statistical Counter Performance on Power-6},
	the read-side performance of these statistical counter
	implementations is pretty horrible.
	So why bother with them?
\QuickQuizAnswer{
	``Use the right tool for the job.''

	As can be seen from
	Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem},
	single-variable atomic increment need not apply for any job
	involving heavy use of parallel updates.
	In contrast, the algorithms shown in
	Table~\ref{tab:count:Statistical Counter Performance on Power-6}
	do an excellent job of handling update-heavy situations.
	Of course, if you have a read-mostly situation, you should
	use something else, for example, an eventually consistent design
	featuring a single atomically incremented
	variable that can be read out using a single load,
	similar to the approach used in
	Section~\ref{sec:count:Eventually Consistent Implementation}.
} \QuickQuizEnd

\begin{table*}
\begin{center}
\begin{tabular}{l|r|c|r|r|r}
	& & & & \multicolumn{2}{|c}{Reads} \\
	\cline{5-6}
	Algorithm & Section & Exact? & Updates & 1 Core & 64 Cores \\
	\hline
	\hline
	\url{count_lim.c} & \ref{sec:count:Simple Limit Counter Implementation} &
		N & 3.6 ns & 375 ns & 50,700 ns \\
	\url{count_lim_app.c} & \ref{sec:count:Approximate Limit Counter Implementation} &
		N & 11.7 ns & 369 ns & 51,000 ns \\
	\url{count_lim_atomic.c} & \ref{sec:count:Atomic Limit Counter Implementation} &
		Y & 51.4 ns & 427 ns & 49,400 ns \\
	\url{count_lim_sig.c} & \ref{sec:count:Signal-Theft Limit Counter Implementation} &
		Y & 10.2 ns & 370 ns & 54,000 ns \\
\end{tabular}
\end{center}
\caption{Limit Counter Performance on Power-6}
\label{tab:count:Limit Counter Performance on Power-6}
\end{table*}

Figure~\ref{tab:count:Limit Counter Performance on Power-6}
shows the performance of the parallel limit-counting algorithms.
Exact enforcement of the limits incurs a substantial performance
penalty, although on this 4.7GHz Power-6 system that penalty can be reduced
by substituting read-side signals for update-side atomic operations.
All of these implementations suffer from read-side lock contention
in the face of concurrent readers.

\QuickQuiz{}
	Given the performance data shown in
	Table~\ref{tab:count:Limit Counter Performance on Power-6},
	we should always prefer update-side signals over read-side
	atomic operations, right?
\QuickQuizAnswer{
	That depends on the workload.
	Note that you need almost one hundred thousand readers (with roughly
	a 60-nanosecond performance gain) to make up for even one
	writer (with almost a 5-\emph{millisecond} performance loss).
	Although there are no shortage of workloads with far greater
	read intensity, you will need to consider your particular
	workload.

	In addition, although memory barriers have historically been
	expensive compared to ordinary instructions, you should
	check this on the specific hardware you will be running.
	The properties of computer hardware do change over time,
	and algorithms must change accordingly.
} \QuickQuizEnd

\QuickQuiz{}
	Can advanced techniques be applied to address the lock
	contention for readers seen in
	Table~\ref{tab:count:Limit Counter Performance on Power-6}?
\QuickQuizAnswer{
	One approach is to give up some update-side performance, as is
	done with scalable non-zero indicators
	(SNZI)~\cite{FaithEllen:2007:SNZI}.
	There are a number of other ways one might go about this, and these
	are left as exercises for the reader.
	Any number of approaches that apply hierarchy, which replace
	frequent global-lock acquisitions with local lock acquisitions
	corresponding to lower levels of the hierarchy, should work quite well.
} \QuickQuizEnd

The fact that these algorithms only work well in their respective special
cases might be considered a major problem with parallel programming in
general.
After all, the C-language \co{++} operator works just fine in single-threaded
code, and not just for special cases, but in general, right?

This line of reasoning does contain a grain of truth, but is in essence
misguided.
The problem is not parallelism as such, but rather scalability.
To understand this, first consider the C-language \co{++} operator.
The fact is that it does \emph{not} work in general, only for a restricted
range of numbers.
If you need to deal with 1,000-digit decimal numbers, the C-language \co{++}
operator will not work for you.

\QuickQuiz{}
	The \co{++} operator works just fine for 1,000-digit numbers!
	Haven't you heard of operator overloading???
\QuickQuizAnswer{
	In the C++ language, you might well be able to use \co{++}
	on a 1,000-digit number, assuming that you had access to a
	class implementing such numbers.
	But as of 2010, the C language does not permit operator overloading.
} \QuickQuizEnd

This problem is not specific to arithmetic.
Suppose you need to store and query data.
Should you use an ASCII file, XML, a relational database, a linked list,
a dense array, a B-tree, a radix tree, or any of the plethora of other data
structures and environments that permit data to be stored and queried?
It depends on what you need to do, how fast you need it done, and how
large your data set is.

Similarly, if you need to count, your solution will depend on how large
of numbers you need to work with, how many CPUs need to be manipulating
a given number concurrently, how the number is to be used, and what
level of performance and scalability you will need.

Nor is this problem specific to software.
The design for a bridge meant to allow people to walk across a small brook
might be a simple as a single wooden plank.
But you would probably not use a plank to span the kilometers-wide mouth of
the Columbia River, nor would such a design be advisable for bridges
carrying concrete trucks.
In short, just as bridge design must change with increasing span and load,
so must software design change as the number of CPUs increases.

The examples in this chapter have shown that an important tool permitting
large numbers of CPUs to be brought to bear is \emph{partitioning}.
The counters might be fully partitioned, as in the statistical counters
discussed in Section~\ref{sec:count:Statistical Counters},
or partially partitioned as in the limit counters discussed in
Sections~\ref{sec:count:Approximate Limit Counters} and
\ref{sec:count:Exact Limit Counters}.
Partitioning in general will be considered in far greater depth in
Chapter~\ref{cha:Partitioning and Synchronization Design},
and partial parallelization in particular in
Section~\ref{sec:SMPdesign:Parallel Fastpath}, where it is called
\emph{parallel fastpath}.

\QuickQuiz{}
	But if we are going to have to partition everything, why bother
	with shared-memory multithreading?
	Why not just partition the problem completely and run as
	multiple processes, each in its own address space?
\QuickQuizAnswer{
	Indeed, multiple processes with separate address spaces can be
	an excellent way to exploit parallelism, as the proponents of
	the fork-join methodology and the Erlang language would be very
	quick to tell you.
	However, there are also some advantages to shared-memory parallelism:
	\begin{enumerate}
	\item	Only the most performance-critical portions of the
		application must be partitioned, and such portions
		are usually a small fraction of the application.
	\item	Although cache misses are quite slow compared to
		individual register-to-register instructions,
		they are typically considerably faster than
		inter-process-communication primitives, which in
		turn are considerably faster than things like
		TCP/IP networking.
	\item	Shared-memory multiprocessors are readily available
		and quite inexpensive, so, in stark contrast to the
		1990s, there is little cost penalty for use of
		shared-memory parallelism.
	\end{enumerate}
	As always, use the right tool for the job!
} \QuickQuizEnd

The partially partitioned counting algorithms used locking to
guard the global data, and locking is the subject of
Chapter~\ref{chp:Locking}.
In contrast, the partitioned data tended to be fully under the control of
the corresponding thread, so that no synchronization whatsoever was required.
This \emph{data ownership} will be introduced in
Section~\ref{sec:SMPdesign:Data Ownership}
and discussed in more detail in
Chapter~\ref{chp:Data Ownership}.

Finally, the eventually consistent statistical counter discussed in
Section~\ref{sec:count:Eventually Consistent Implementation}
showed how deferring activity (in that case, updating the global
counter) can provide substantial performance and scalability benefits.
Chapter~\ref{chp:Deferred Processing} will examine a number of additional
ways that deferral can improve performance, scalability, and even
real-time response.

In short, as noted at the beginning of this chapter, the simplicity
of the concepts underlying counting have allowed us to explore many
fundamental concurrency issues without the distraction of elaborate
data structures or complex synchronization primitives.
Later chapters will dig more deeply into these issues.
